Solr 5: hit highlight with NGram/EdgeNgram-fields

2015-04-20 Thread Bjørn Hjelle
with Solr 4.10.3 I was advised to set luceneMatchVersion to 4.3 to make
hit highlight work with NGram/EdgeNgram- fields, like this:

 filter class=solr.EdgeNGramFilterFactory maxGramSize=20
minGramSize=1 luceneMatchVersion=4.3/

In Solr 5 and 5.1 this seems to not work any more.
The complete word is  highlighted, not just the part that matches the
search term.

In Solr admin analysis page it again does not show the proper end-offset
positions. What is shows is this:

LENGTF
textt   te  tes test
raw_bytes   [74][74 65] [74 65 73]  [74 65 73 74]
start   0   0   0   0
end 4   4   4   4
positionLength  1   1   1   1
typewordwordwordword
position1   1   1   1

In Solr 4.10.3 with LuceneMatchVersion set to 4.3 end offset would be: 1,
2, 3, 4 and hit higlight would work.

Any advise on making hit highlight with (Edge)NGram -fields would be highly
appreciated!

Thanks,
Bjørn


Combination of edgengram and ngram

2011-12-13 Thread Shawn Heisey
I am interested in a new filter type, one that would combine edgengram 
and ngram.  The idea is that it would create all ngrams specified by the 
min/max size, but the ngrams that happen to be edgengrams (specifically 
the left side) would get an index-time boost.  Optionally the boost 
would be higher if it came from the first token.


The use case:  An automatic autosuggest dropdown that populates as a 
user types into a search box.  The index would have one field and it 
would be built from a manually produced list of suggested search 
phrases.  The boosts mentioned would make it so that matches from the 
beginning of a word, and especially from the beginning of the entire 
suggested phrase, would be returned first.


I could get a similar effect by using a copyfield, analyzing one field 
with ngrams and the other with edgengrams, then using edismax to put a 
boost on the edge version.  I will start with this method, but using 
copyfield makes the index bigger, and using dismax makes the ultimate 
parsed queries more complicated.


If I can avoid the copyfield, the index will be smaller and the queries 
very simple, which should make for very high speed.


I will take a look at the source code, but I'm a bit of a Java novice.  
Does anyone have the knowledge, desire, and time to crank this one out 
quickly?  Is it possible someone has already written such a filter?


Thanks,
Shawn



Problem using EdgeNGram

2011-09-21 Thread Kissue Kissue
Hi,

I am using solr 3.3 with SolrJ. I am trying to use EdgeNgram to power auto
suggest feature in my application. My understanding is that using EdgeNgram
would mean that results will only be returned for records starting with the
search criteria but this is not happening for me.

For example if i search for tr, i get results as following:

Greenham Trading 6
IT Training Publications
AA Training

Below are details of my configuration:

fieldType name=edgytext class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=15 /
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

field name=businessName type=edgytext indexed=true stored=true
required=true omitNorms=true omitTermFreqAndPositions=true /

Any ideas why this is happening will be much appreciated.

Thanks.


Re: Problem using EdgeNGram

2011-09-21 Thread O. Klein
Try using KeywordTokenizerFactory instead of StandardTokenizerFactory to get
the results you want.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-using-EdgeNGram-tp3355132p3355211.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Edgengram

2011-06-01 Thread Brian Lamb
Hi Tomás,

Thank you very much for your suggestion. I took another crack at it using
your recommendation and it worked ideally. The only thing I had to change
was

analyzer type=query
  tokenizer class=solr.KeywordTokenizerFactory /
/analyzer

to

analyzer type=query
  tokenizer class=solr.LowerCaseTokenizerFactory /
/analyzer

The first did not produce any results but the second worked beautifully.

Thanks!

Brian Lamb

2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com

 ...or also use the LowerCaseTokenizerFactory at query time for consistency,
 but not the edge ngram filter.

 2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com

  Hi Brian, I don't know if I understand what you are trying to achieve.
 You
  want the term query abcdefg to have an idf of 1 insead of 7? I think
 using
  the KeywordTokenizerFilterFactory at query time should work. I would be
  something like:
 
  fieldType name=edgengram class=solr.TextField
  positionIncrementGap=1000
analyzer type=index
 
  tokenizer class=solr.LowerCaseTokenizerFactory /
  filter class=solr.EdgeNGramFilterFactory minGramSize=1
  maxGramSize=25 side=front /
/analyzer
analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory /
/analyzer
  /fieldType
 
  this way, at query time abcdefg won't be turned to a ab abc abcd abcde
  abcdef abcdefg. At index time it will.
 
  Regards,
  Tomás
 
 
  On Tue, May 31, 2011 at 1:07 PM, Brian Lamb 
 brian.l...@journalexperts.com
   wrote:
 
  fieldType name=edgengram class=solr.TextField
  positionIncrementGap=1000
analyzer
  tokenizer class=solr.LowerCaseTokenizerFactory /
  filter class=solr.EdgeNGramFilterFactory minGramSize=1
  maxGramSize=25 side=front /
/analyzer
  /fieldType
 
  I believe I used that link when I initially set up the field and it
 worked
  great (and I'm still using it in other places). In this particular
 example
  however it does not appear to be practical for me. I mentioned that I
 have
  a
  similarity class that returns 1 for the idf and in the case of an
  edgengram,
  it returns 1 * length of the search string.
 
  Thanks,
 
  Brian Lamb
 
  On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com 
  bmdakshinamur...@gmail.com wrote:
 
   Can you specify the analyzer you are using for your queries?
  
   May be you could use a KeywordAnalyzer for your queries so you don't
 end
  up
   matching parts of your query.
  
  
 
 http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
   This should help you.
  
   On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
   brian.l...@journalexperts.comwrote:
  
In this particular case, I will be doing a solr search based on user
preferences. So I will not be depending on the user to type
 abcdefg.
   That
will be automatically generated based on user selections.
   
The contents of the field do not contain spaces and since I am
 created
   the
search parameters, case isn't important either.
   
Thanks,
   
Brian Lamb
   
On Tue, May 31, 2011 at 9:44 AM, Erick Erickson 
  erickerick...@gmail.com
wrote:
   
 That'll work for your case, although be aware that string types
  aren't
 analyzed at all,
 so case matters, as do spaces etc.

 What is the use-case here? If you explain it a bit there might be
 better answers

 Best
 Erick

 On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
 brian.l...@journalexperts.com wrote:
  For this, I ended up just changing it to string and using
  abcdefg*
   to
  match. That seems to work so far.
 
  Thanks,
 
  Brian Lamb
 
  On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
  brian.l...@journalexperts.comwrote:
 
  Hi all,
 
  I'm running into some confusion with the way edgengram works. I
  have
the
  field set up as:
 
  fieldType name=edgengram class=solr.TextField
  positionIncrementGap=1000
 analyzer
   tokenizer class=solr.LowerCaseTokenizerFactory /
 filter class=solr.EdgeNGramFilterFactory
  minGramSize=1
  maxGramSize=100 side=front /
 /analyzer
  /fieldType
 
  I've also set up my own similarity class that returns 1 as the
  idf
 score.
  What I've found this does is if I match a string abcdefg
  against a
 field
  containing abcdefghijklmnop, then the idf will score that as
 a
  7:
 
  7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2
   abcdefg=2)
 
  I get why that's happening, but is there a way to avoid that?
 Do
  I
need
 to
  do a new field type to achieve the desired affect?
 
  Thanks,
 
  Brian Lamb
 
 

   
  
  
  
   --
   Thanks and Regards,
   DakshinaMurthy BM
  
 
 
 



Re: Edgengram

2011-06-01 Thread Erick Erickson
Be a little careful here. LowerCaseTokenizerFactory is different than
KeywordTokenizerFactory.

LowerCaseTokenizerFactory will give you more than one term. e.g.
the string Intelligence can't be MeaSurEd will give you 5 terms,
any of which may match. i.e.
intelligence, can, t, be, measured.
whereas KeywordTokenizerFactory followed, by, say LowerCaseFilter
would give you exactly one token:
intelligence can't be measured.

So searching for measured would get a hit in the first case but
not in the second. Searching for intellig* would hit both.

Neither is better, just make sure they do what you want!

This page will help a lot:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory
as will the admin/analysis page.

Best
Erick

On Wed, Jun 1, 2011 at 10:43 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
 Hi Tomás,

 Thank you very much for your suggestion. I took another crack at it using
 your recommendation and it worked ideally. The only thing I had to change
 was

 analyzer type=query
  tokenizer class=solr.KeywordTokenizerFactory /
 /analyzer

 to

 analyzer type=query
  tokenizer class=solr.LowerCaseTokenizerFactory /
 /analyzer

 The first did not produce any results but the second worked beautifully.

 Thanks!

 Brian Lamb

 2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com

 ...or also use the LowerCaseTokenizerFactory at query time for consistency,
 but not the edge ngram filter.

 2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com

  Hi Brian, I don't know if I understand what you are trying to achieve.
 You
  want the term query abcdefg to have an idf of 1 insead of 7? I think
 using
  the KeywordTokenizerFilterFactory at query time should work. I would be
  something like:
 
  fieldType name=edgengram class=solr.TextField
  positionIncrementGap=1000
    analyzer type=index
 
      tokenizer class=solr.LowerCaseTokenizerFactory /
      filter class=solr.EdgeNGramFilterFactory minGramSize=1
  maxGramSize=25 side=front /
    /analyzer
    analyzer type=query
    tokenizer class=solr.KeywordTokenizerFactory /
    /analyzer
  /fieldType
 
  this way, at query time abcdefg won't be turned to a ab abc abcd abcde
  abcdef abcdefg. At index time it will.
 
  Regards,
  Tomás
 
 
  On Tue, May 31, 2011 at 1:07 PM, Brian Lamb 
 brian.l...@journalexperts.com
   wrote:
 
  fieldType name=edgengram class=solr.TextField
  positionIncrementGap=1000
    analyzer
      tokenizer class=solr.LowerCaseTokenizerFactory /
      filter class=solr.EdgeNGramFilterFactory minGramSize=1
  maxGramSize=25 side=front /
    /analyzer
  /fieldType
 
  I believe I used that link when I initially set up the field and it
 worked
  great (and I'm still using it in other places). In this particular
 example
  however it does not appear to be practical for me. I mentioned that I
 have
  a
  similarity class that returns 1 for the idf and in the case of an
  edgengram,
  it returns 1 * length of the search string.
 
  Thanks,
 
  Brian Lamb
 
  On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com 
  bmdakshinamur...@gmail.com wrote:
 
   Can you specify the analyzer you are using for your queries?
  
   May be you could use a KeywordAnalyzer for your queries so you don't
 end
  up
   matching parts of your query.
  
  
 
 http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
   This should help you.
  
   On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
   brian.l...@journalexperts.comwrote:
  
In this particular case, I will be doing a solr search based on user
preferences. So I will not be depending on the user to type
 abcdefg.
   That
will be automatically generated based on user selections.
   
The contents of the field do not contain spaces and since I am
 created
   the
search parameters, case isn't important either.
   
Thanks,
   
Brian Lamb
   
On Tue, May 31, 2011 at 9:44 AM, Erick Erickson 
  erickerick...@gmail.com
wrote:
   
 That'll work for your case, although be aware that string types
  aren't
 analyzed at all,
 so case matters, as do spaces etc.

 What is the use-case here? If you explain it a bit there might be
 better answers

 Best
 Erick

 On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
 brian.l...@journalexperts.com wrote:
  For this, I ended up just changing it to string and using
  abcdefg*
   to
  match. That seems to work so far.
 
  Thanks,
 
  Brian Lamb
 
  On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
  brian.l...@journalexperts.comwrote:
 
  Hi all,
 
  I'm running into some confusion with the way edgengram works. I
  have
the
  field set up as:
 
  fieldType name=edgengram class=solr.TextField
  positionIncrementGap=1000
     analyzer
       tokenizer class=solr.LowerCaseTokenizerFactory /
         filter class=solr.EdgeNGramFilterFactory

Re: Edgengram

2011-06-01 Thread Brian Lamb
I think in my case LowerCaseTokenizerFactory will be sufficient because
there will never be spaces in this particular field. But thank you for the
useful link!

Thanks,

Brian Lamb

On Wed, Jun 1, 2011 at 11:44 AM, Erick Erickson erickerick...@gmail.comwrote:

 Be a little careful here. LowerCaseTokenizerFactory is different than
 KeywordTokenizerFactory.

 LowerCaseTokenizerFactory will give you more than one term. e.g.
 the string Intelligence can't be MeaSurEd will give you 5 terms,
 any of which may match. i.e.
 intelligence, can, t, be, measured.
 whereas KeywordTokenizerFactory followed, by, say LowerCaseFilter
 would give you exactly one token:
 intelligence can't be measured.

 So searching for measured would get a hit in the first case but
 not in the second. Searching for intellig* would hit both.

 Neither is better, just make sure they do what you want!

 This page will help a lot:

 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory
 as will the admin/analysis page.

 Best
 Erick

 On Wed, Jun 1, 2011 at 10:43 AM, Brian Lamb
 brian.l...@journalexperts.com wrote:
  Hi Tomás,
 
  Thank you very much for your suggestion. I took another crack at it using
  your recommendation and it worked ideally. The only thing I had to change
  was
 
  analyzer type=query
   tokenizer class=solr.KeywordTokenizerFactory /
  /analyzer
 
  to
 
  analyzer type=query
   tokenizer class=solr.LowerCaseTokenizerFactory /
  /analyzer
 
  The first did not produce any results but the second worked beautifully.
 
  Thanks!
 
  Brian Lamb
 
  2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com
 
  ...or also use the LowerCaseTokenizerFactory at query time for
 consistency,
  but not the edge ngram filter.
 
  2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com
 
   Hi Brian, I don't know if I understand what you are trying to achieve.
  You
   want the term query abcdefg to have an idf of 1 insead of 7? I think
  using
   the KeywordTokenizerFilterFactory at query time should work. I would
 be
   something like:
  
   fieldType name=edgengram class=solr.TextField
   positionIncrementGap=1000
 analyzer type=index
  
   tokenizer class=solr.LowerCaseTokenizerFactory /
   filter class=solr.EdgeNGramFilterFactory minGramSize=1
   maxGramSize=25 side=front /
 /analyzer
 analyzer type=query
 tokenizer class=solr.KeywordTokenizerFactory /
 /analyzer
   /fieldType
  
   this way, at query time abcdefg won't be turned to a ab abc abcd
 abcde
   abcdef abcdefg. At index time it will.
  
   Regards,
   Tomás
  
  
   On Tue, May 31, 2011 at 1:07 PM, Brian Lamb 
  brian.l...@journalexperts.com
wrote:
  
   fieldType name=edgengram class=solr.TextField
   positionIncrementGap=1000
 analyzer
   tokenizer class=solr.LowerCaseTokenizerFactory /
   filter class=solr.EdgeNGramFilterFactory minGramSize=1
   maxGramSize=25 side=front /
 /analyzer
   /fieldType
  
   I believe I used that link when I initially set up the field and it
  worked
   great (and I'm still using it in other places). In this particular
  example
   however it does not appear to be practical for me. I mentioned that I
  have
   a
   similarity class that returns 1 for the idf and in the case of an
   edgengram,
   it returns 1 * length of the search string.
  
   Thanks,
  
   Brian Lamb
  
   On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com 
   bmdakshinamur...@gmail.com wrote:
  
Can you specify the analyzer you are using for your queries?
   
May be you could use a KeywordAnalyzer for your queries so you
 don't
  end
   up
matching parts of your query.
   
   
  
 
 http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
This should help you.
   
On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
brian.l...@journalexperts.comwrote:
   
 In this particular case, I will be doing a solr search based on
 user
 preferences. So I will not be depending on the user to type
  abcdefg.
That
 will be automatically generated based on user selections.

 The contents of the field do not contain spaces and since I am
  created
the
 search parameters, case isn't important either.

 Thanks,

 Brian Lamb

 On Tue, May 31, 2011 at 9:44 AM, Erick Erickson 
   erickerick...@gmail.com
 wrote:

  That'll work for your case, although be aware that string types
   aren't
  analyzed at all,
  so case matters, as do spaces etc.
 
  What is the use-case here? If you explain it a bit there might
 be
  better answers
 
  Best
  Erick
 
  On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
  brian.l...@journalexperts.com wrote:
   For this, I ended up just changing it to string and using
   abcdefg*
to
   match. That seems to work so far.
  
   Thanks,
  
   Brian Lamb
  
   On Wed, May

Re: Edgengram

2011-05-31 Thread Erick Erickson
That'll work for your case, although be aware that string types aren't
analyzed at all,
so case matters, as do spaces etc.

What is the use-case here? If you explain it a bit there might be
better answers

Best
Erick

On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
 For this, I ended up just changing it to string and using abcdefg* to
 match. That seems to work so far.

 Thanks,

 Brian Lamb

 On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
 brian.l...@journalexperts.comwrote:

 Hi all,

 I'm running into some confusion with the way edgengram works. I have the
 field set up as:

 fieldType name=edgengram class=solr.TextField
 positionIncrementGap=1000
    analyzer
      tokenizer class=solr.LowerCaseTokenizerFactory /
        filter class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=100 side=front /
    /analyzer
 /fieldType

 I've also set up my own similarity class that returns 1 as the idf score.
 What I've found this does is if I match a string abcdefg against a field
 containing abcdefghijklmnop, then the idf will score that as a 7:

 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2)

 I get why that's happening, but is there a way to avoid that? Do I need to
 do a new field type to achieve the desired affect?

 Thanks,

 Brian Lamb




Re: Edgengram

2011-05-31 Thread Brian Lamb
In this particular case, I will be doing a solr search based on user
preferences. So I will not be depending on the user to type abcdefg. That
will be automatically generated based on user selections.

The contents of the field do not contain spaces and since I am created the
search parameters, case isn't important either.

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 9:44 AM, Erick Erickson erickerick...@gmail.comwrote:

 That'll work for your case, although be aware that string types aren't
 analyzed at all,
 so case matters, as do spaces etc.

 What is the use-case here? If you explain it a bit there might be
 better answers

 Best
 Erick

 On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
 brian.l...@journalexperts.com wrote:
  For this, I ended up just changing it to string and using abcdefg* to
  match. That seems to work so far.
 
  Thanks,
 
  Brian Lamb
 
  On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
  brian.l...@journalexperts.comwrote:
 
  Hi all,
 
  I'm running into some confusion with the way edgengram works. I have the
  field set up as:
 
  fieldType name=edgengram class=solr.TextField
  positionIncrementGap=1000
 analyzer
   tokenizer class=solr.LowerCaseTokenizerFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1
  maxGramSize=100 side=front /
 /analyzer
  /fieldType
 
  I've also set up my own similarity class that returns 1 as the idf
 score.
  What I've found this does is if I match a string abcdefg against a
 field
  containing abcdefghijklmnop, then the idf will score that as a 7:
 
  7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2)
 
  I get why that's happening, but is there a way to avoid that? Do I need
 to
  do a new field type to achieve the desired affect?
 
  Thanks,
 
  Brian Lamb
 
 



Re: Edgengram

2011-05-31 Thread bmdakshinamur...@gmail.com
Can you specify the analyzer you are using for your queries?

May be you could use a KeywordAnalyzer for your queries so you don't end up
matching parts of your query.
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
This should help you.

On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

 In this particular case, I will be doing a solr search based on user
 preferences. So I will not be depending on the user to type abcdefg. That
 will be automatically generated based on user selections.

 The contents of the field do not contain spaces and since I am created the
 search parameters, case isn't important either.

 Thanks,

 Brian Lamb

 On Tue, May 31, 2011 at 9:44 AM, Erick Erickson erickerick...@gmail.com
 wrote:

  That'll work for your case, although be aware that string types aren't
  analyzed at all,
  so case matters, as do spaces etc.
 
  What is the use-case here? If you explain it a bit there might be
  better answers
 
  Best
  Erick
 
  On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
  brian.l...@journalexperts.com wrote:
   For this, I ended up just changing it to string and using abcdefg* to
   match. That seems to work so far.
  
   Thanks,
  
   Brian Lamb
  
   On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
   brian.l...@journalexperts.comwrote:
  
   Hi all,
  
   I'm running into some confusion with the way edgengram works. I have
 the
   field set up as:
  
   fieldType name=edgengram class=solr.TextField
   positionIncrementGap=1000
  analyzer
tokenizer class=solr.LowerCaseTokenizerFactory /
  filter class=solr.EdgeNGramFilterFactory minGramSize=1
   maxGramSize=100 side=front /
  /analyzer
   /fieldType
  
   I've also set up my own similarity class that returns 1 as the idf
  score.
   What I've found this does is if I match a string abcdefg against a
  field
   containing abcdefghijklmnop, then the idf will score that as a 7:
  
   7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2)
  
   I get why that's happening, but is there a way to avoid that? Do I
 need
  to
   do a new field type to achieve the desired affect?
  
   Thanks,
  
   Brian Lamb
  
  
 




-- 
Thanks and Regards,
DakshinaMurthy BM


Re: Edgengram

2011-05-31 Thread Brian Lamb
fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
   analyzer
 tokenizer class=solr.LowerCaseTokenizerFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=25 side=front /
   /analyzer
/fieldType

I believe I used that link when I initially set up the field and it worked
great (and I'm still using it in other places). In this particular example
however it does not appear to be practical for me. I mentioned that I have a
similarity class that returns 1 for the idf and in the case of an edgengram,
it returns 1 * length of the search string.

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com 
bmdakshinamur...@gmail.com wrote:

 Can you specify the analyzer you are using for your queries?

 May be you could use a KeywordAnalyzer for your queries so you don't end up
 matching parts of your query.

 http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
 This should help you.

 On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
 brian.l...@journalexperts.comwrote:

  In this particular case, I will be doing a solr search based on user
  preferences. So I will not be depending on the user to type abcdefg.
 That
  will be automatically generated based on user selections.
 
  The contents of the field do not contain spaces and since I am created
 the
  search parameters, case isn't important either.
 
  Thanks,
 
  Brian Lamb
 
  On Tue, May 31, 2011 at 9:44 AM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   That'll work for your case, although be aware that string types aren't
   analyzed at all,
   so case matters, as do spaces etc.
  
   What is the use-case here? If you explain it a bit there might be
   better answers
  
   Best
   Erick
  
   On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
   brian.l...@journalexperts.com wrote:
For this, I ended up just changing it to string and using abcdefg*
 to
match. That seems to work so far.
   
Thanks,
   
Brian Lamb
   
On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
brian.l...@journalexperts.comwrote:
   
Hi all,
   
I'm running into some confusion with the way edgengram works. I have
  the
field set up as:
   
fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
   analyzer
 tokenizer class=solr.LowerCaseTokenizerFactory /
   filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=100 side=front /
   /analyzer
/fieldType
   
I've also set up my own similarity class that returns 1 as the idf
   score.
What I've found this does is if I match a string abcdefg against a
   field
containing abcdefghijklmnop, then the idf will score that as a 7:
   
7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2
 abcdefg=2)
   
I get why that's happening, but is there a way to avoid that? Do I
  need
   to
do a new field type to achieve the desired affect?
   
Thanks,
   
Brian Lamb
   
   
  
 



 --
 Thanks and Regards,
 DakshinaMurthy BM



Re: Edgengram

2011-05-31 Thread Tomás Fernández Löbbe
Hi Brian, I don't know if I understand what you are trying to achieve. You
want the term query abcdefg to have an idf of 1 insead of 7? I think using
the KeywordTokenizerFilterFactory at query time should work. I would be
something like:

fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
  analyzer type=index
tokenizer class=solr.LowerCaseTokenizerFactory /
filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=25 side=front /
  /analyzer
  analyzer type=query
  tokenizer class=solr.KeywordTokenizerFactory /
  /analyzer
/fieldType

this way, at query time abcdefg won't be turned to a ab abc abcd abcde
abcdef abcdefg. At index time it will.

Regards,
Tomás


On Tue, May 31, 2011 at 1:07 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

 fieldType name=edgengram class=solr.TextField
 positionIncrementGap=1000
   analyzer
 tokenizer class=solr.LowerCaseTokenizerFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=25 side=front /
   /analyzer
 /fieldType

 I believe I used that link when I initially set up the field and it worked
 great (and I'm still using it in other places). In this particular example
 however it does not appear to be practical for me. I mentioned that I have
 a
 similarity class that returns 1 for the idf and in the case of an
 edgengram,
 it returns 1 * length of the search string.

 Thanks,

 Brian Lamb

 On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com 
 bmdakshinamur...@gmail.com wrote:

  Can you specify the analyzer you are using for your queries?
 
  May be you could use a KeywordAnalyzer for your queries so you don't end
 up
  matching parts of your query.
 
 
 http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
  This should help you.
 
  On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
  brian.l...@journalexperts.comwrote:
 
   In this particular case, I will be doing a solr search based on user
   preferences. So I will not be depending on the user to type abcdefg.
  That
   will be automatically generated based on user selections.
  
   The contents of the field do not contain spaces and since I am created
  the
   search parameters, case isn't important either.
  
   Thanks,
  
   Brian Lamb
  
   On Tue, May 31, 2011 at 9:44 AM, Erick Erickson 
 erickerick...@gmail.com
   wrote:
  
That'll work for your case, although be aware that string types
 aren't
analyzed at all,
so case matters, as do spaces etc.
   
What is the use-case here? If you explain it a bit there might be
better answers
   
Best
Erick
   
On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
 For this, I ended up just changing it to string and using
 abcdefg*
  to
 match. That seems to work so far.

 Thanks,

 Brian Lamb

 On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
 brian.l...@journalexperts.comwrote:

 Hi all,

 I'm running into some confusion with the way edgengram works. I
 have
   the
 field set up as:

 fieldType name=edgengram class=solr.TextField
 positionIncrementGap=1000
analyzer
  tokenizer class=solr.LowerCaseTokenizerFactory /
filter class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=100 side=front /
/analyzer
 /fieldType

 I've also set up my own similarity class that returns 1 as the idf
score.
 What I've found this does is if I match a string abcdefg against
 a
field
 containing abcdefghijklmnop, then the idf will score that as a
 7:

 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2
  abcdefg=2)

 I get why that's happening, but is there a way to avoid that? Do I
   need
to
 do a new field type to achieve the desired affect?

 Thanks,

 Brian Lamb


   
  
 
 
 
  --
  Thanks and Regards,
  DakshinaMurthy BM
 



Re: Edgengram

2011-05-31 Thread Tomás Fernández Löbbe
...or also use the LowerCaseTokenizerFactory at query time for consistency,
but not the edge ngram filter.

2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com

 Hi Brian, I don't know if I understand what you are trying to achieve. You
 want the term query abcdefg to have an idf of 1 insead of 7? I think using
 the KeywordTokenizerFilterFactory at query time should work. I would be
 something like:

 fieldType name=edgengram class=solr.TextField
 positionIncrementGap=1000
   analyzer type=index

 tokenizer class=solr.LowerCaseTokenizerFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=25 side=front /
   /analyzer
   analyzer type=query
   tokenizer class=solr.KeywordTokenizerFactory /
   /analyzer
 /fieldType

 this way, at query time abcdefg won't be turned to a ab abc abcd abcde
 abcdef abcdefg. At index time it will.

 Regards,
 Tomás


 On Tue, May 31, 2011 at 1:07 PM, Brian Lamb brian.l...@journalexperts.com
  wrote:

 fieldType name=edgengram class=solr.TextField
 positionIncrementGap=1000
   analyzer
 tokenizer class=solr.LowerCaseTokenizerFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=25 side=front /
   /analyzer
 /fieldType

 I believe I used that link when I initially set up the field and it worked
 great (and I'm still using it in other places). In this particular example
 however it does not appear to be practical for me. I mentioned that I have
 a
 similarity class that returns 1 for the idf and in the case of an
 edgengram,
 it returns 1 * length of the search string.

 Thanks,

 Brian Lamb

 On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com 
 bmdakshinamur...@gmail.com wrote:

  Can you specify the analyzer you are using for your queries?
 
  May be you could use a KeywordAnalyzer for your queries so you don't end
 up
  matching parts of your query.
 
 
 http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
  This should help you.
 
  On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
  brian.l...@journalexperts.comwrote:
 
   In this particular case, I will be doing a solr search based on user
   preferences. So I will not be depending on the user to type abcdefg.
  That
   will be automatically generated based on user selections.
  
   The contents of the field do not contain spaces and since I am created
  the
   search parameters, case isn't important either.
  
   Thanks,
  
   Brian Lamb
  
   On Tue, May 31, 2011 at 9:44 AM, Erick Erickson 
 erickerick...@gmail.com
   wrote:
  
That'll work for your case, although be aware that string types
 aren't
analyzed at all,
so case matters, as do spaces etc.
   
What is the use-case here? If you explain it a bit there might be
better answers
   
Best
Erick
   
On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
 For this, I ended up just changing it to string and using
 abcdefg*
  to
 match. That seems to work so far.

 Thanks,

 Brian Lamb

 On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
 brian.l...@journalexperts.comwrote:

 Hi all,

 I'm running into some confusion with the way edgengram works. I
 have
   the
 field set up as:

 fieldType name=edgengram class=solr.TextField
 positionIncrementGap=1000
analyzer
  tokenizer class=solr.LowerCaseTokenizerFactory /
filter class=solr.EdgeNGramFilterFactory
 minGramSize=1
 maxGramSize=100 side=front /
/analyzer
 /fieldType

 I've also set up my own similarity class that returns 1 as the
 idf
score.
 What I've found this does is if I match a string abcdefg
 against a
field
 containing abcdefghijklmnop, then the idf will score that as a
 7:

 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2
  abcdefg=2)

 I get why that's happening, but is there a way to avoid that? Do
 I
   need
to
 do a new field type to achieve the desired affect?

 Thanks,

 Brian Lamb


   
  
 
 
 
  --
  Thanks and Regards,
  DakshinaMurthy BM
 





Re: Edgengram

2011-05-27 Thread Brian Lamb
For this, I ended up just changing it to string and using abcdefg* to
match. That seems to work so far.

Thanks,

Brian Lamb

On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

 Hi all,

 I'm running into some confusion with the way edgengram works. I have the
 field set up as:

 fieldType name=edgengram class=solr.TextField
 positionIncrementGap=1000
analyzer
  tokenizer class=solr.LowerCaseTokenizerFactory /
filter class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=100 side=front /
/analyzer
 /fieldType

 I've also set up my own similarity class that returns 1 as the idf score.
 What I've found this does is if I match a string abcdefg against a field
 containing abcdefghijklmnop, then the idf will score that as a 7:

 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2)

 I get why that's happening, but is there a way to avoid that? Do I need to
 do a new field type to achieve the desired affect?

 Thanks,

 Brian Lamb



Edgengram

2011-05-25 Thread Brian Lamb
Hi all,

I'm running into some confusion with the way edgengram works. I have the
field set up as:

fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
   analyzer
 tokenizer class=solr.LowerCaseTokenizerFactory /
   filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=100 side=front /
   /analyzer
/fieldType

I've also set up my own similarity class that returns 1 as the idf score.
What I've found this does is if I match a string abcdefg against a field
containing abcdefghijklmnop, then the idf will score that as a 7:

7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2)

I get why that's happening, but is there a way to avoid that? Do I need to
do a new field type to achieve the desired affect?

Thanks,

Brian Lamb


LetterTokenizer + EdgeNGram + apostrophe in query = invalid result

2011-02-25 Thread Matt Weber
I have the following field defined in my schema:

  fieldType name=ngram class=solr.TextField positionIncrementGap=100
analyzer type=index
  tokenizer class=solr.LetterTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=25 /
/analyzer
analyzer type=query
  tokenizer class=solr.LetterTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
/analyzer
  /fieldType

  field name=person type=ngram indexed=true stored=true /

I have the default field set to person and have indexed the
following document:

add
   doc
   field name=id![CDATA[1001116609]]/field
   field name=person![CDATA[Vincent M D'Onofrio]]/field
   /doc
/add


The following queries return the result as expected using the standard
request handler:

vincent m d onofrio
d'o
onofrio
d onofrio

The following query fails:

d'onofrio

This is weird because d'o returns a result.  As soon as I type the
n I start to get no results.  I ran this though the field analysis
page and  it shows that this query is being tokenized correctly and
should yield a result.

I am using a build of trunk Solr (r1073990) and the example
solrconfig.xml.  I am also using the example schema with the addition
of my ngram field.

Any ideas?  I have tried this with other word's containing an
apostrophe and they all stop returning results after 4 characters.


Thanks,
Matt Weber


Re: EdgeNgram Auto suggest - doubles ignore

2011-02-08 Thread johnnyisrael

Hi Erick,

If you have time, Can you please take a look and provide your comments (or)
suggestions for this problem?

Please let me know if you need any more information.

Thanks,

Johnny
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2451828.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: EdgeNgram Auto suggest - doubles ignore

2011-02-08 Thread Erick Erickson
I'm afraid I'll have to pass, I'm absolutely swamped at the moment. Perhaps
someone else can pick it up.

I will say that you should be getting terms back when you pre-lower-case
them, so look in your index via the admin page or Luke to see if what's
really in your index is what you think in the name field.

As for sorting, I haven't a clue. Start by backing out your custom sorting,
verifying that things are as you expect for everything *except* sorting and
add
it back in

Best
Erick



On Tue, Feb 8, 2011 at 10:11 AM, johnnyisrael johnnyi.john...@gmail.comwrote:


 Hi Erick,

 If you have time, Can you please take a look and provide your comments (or)
 suggestions for this problem?

 Please let me know if you need any more information.

 Thanks,

 Johnny
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2451828.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: EdgeNgram Auto suggest - doubles ignore

2011-02-01 Thread johnnyisrael

Hi Erick,

I tried to use terms component, I got ended up with the following problems.

Problem: 1

Custom Sort not working in terms component:

http://lucene.472066.n3.nabble.com/Term-component-sort-is-not-working-td1905059.html#a1909386

I want to sort using one of my custom field[value_score], I gave it aleady
in my configuration, but it is not sorting properly.

The following are the configuration in solrconfig.xml

  searchComponent name=termsComponent
class=org.apache.solr.handler.component.TermsComponent/

  requestHandler name=/terms
class=org.apache.solr.handler.component.SearchHandler
 lst name=defaults
bool name=termstrue/bool
str name=wtjson/str
str name=flname/str
str name=sortvalue_score desc/str
str name=indenttrue/str
/lst 
arr name=components
  strtermsComponent/str
/arr
  /requestHandler

The SOLR response tag is not returned based on sorted parameter.

Problem: 2

Cap sensitive problem: [I am searching for Apple]

http://localhost/solr/core1/terms?terms.fl=nameterms.prefix=apple -- not
working

http://localhost/solr/core1/terms?terms.fl=nameterms.prefix=Apple --
working

Tried regex to overcome cap-sensitive problem: 

http://localhost/solr/core1/terms?terms.fl=nameterms.regex=Appleterms.regex.flag=case_insensitive

Is this regex based search will help me for my requirement?

It is returning irrelevant results. I am using the same syntax it is
mentioned in WIKI.

http://wiki.apache.org/solr/TermsComponent

Am I going wrong anywhere?

Please let me know if you need any more info.

Thanks,

Johnny
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2399330.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread johnnyisrael

Hi Eric,

You are right, there is a copy field to EdgeNgram, I tried the configuration
but it not working as expected.

Configuration I tried:



fieldType name=”query” class=”solr.TextField” positionIncrementGap=”100″
termVectors=”true”
analyzer type=”index”
tokenizer class=”solr.StandardTokenizerFactory”/
filter class=”solr.LowerCaseFilterFactory”/
/analyzer
analyzer type=”query”
tokenizer class=”solr.StandardTokenizerFactory”/
filter class=”solr.LowerCaseFilterFactory”/
/analyzer
/fieldType

fieldType name=”edgytext” class=”solr.TextField”
positionIncrementGap=”100″
analyzer type=”index”
tokenizer class=”solr.WhitespaceTokenizerFactory”/
filter class=”solr.LowerCaseFilterFactory”/
filter class=”solr.EdgeNGramFilterFactory” minGramSize=”3″
maxGramSize=”25″/
/analyzer
analyzer type=”query”
tokenizer class=”solr.KeywordTokenizerFactory”/
filter class=”solr.LowerCaseFilterFactory”/
/analyzer
/fieldType

field name=”user_query” type=”query” indexed=”true” stored=”true”
omitNorms=”true” omitTermFreqAndPositions=”true” /
field name=”edgy_user_query” type=”edgytext” indexed=”true” stored=”true”
omitNorms=”true” omitTermFreqAndPositions=”true” /

defaultSearchFieldedgy_user_query/defaultSearchField
copyField source=”user_query” dest=”edgy_user_query”/

==

When I search for the term apple.

It is returning results for pineapple vers apple, milk with apple,
apple milk shake ...

Is there any other way to overcome this problem?

Thanks,

Johnny


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2329370.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Erick Erickson
Let's back up here because now I'm not clear what you actually want.
EdgeNGrams
are a way of matching substrings, which is what's happening here. Of course
searching apple against any of the three examples, just as searching for
apple
without grams would match, that's the expected behavior.

So, we need a clear problem definition of what you're trying to do, along
with
example queries (please post the results of adding debugQuery=on).

Best
Erick

On Tue, Jan 25, 2011 at 8:29 AM, johnnyisrael johnnyi.john...@gmail.comwrote:


 Hi Eric,

 You are right, there is a copy field to EdgeNgram, I tried the
 configuration
 but it not working as expected.

 Configuration I tried:

 

 fieldType name=”query” class=”solr.TextField” positionIncrementGap=”100″
 termVectors=”true”
 analyzer type=”index”
 tokenizer class=”solr.StandardTokenizerFactory”/
 filter class=”solr.LowerCaseFilterFactory”/
 /analyzer
 analyzer type=”query”
 tokenizer class=”solr.StandardTokenizerFactory”/
 filter class=”solr.LowerCaseFilterFactory”/
 /analyzer
 /fieldType

 fieldType name=”edgytext” class=”solr.TextField”
 positionIncrementGap=”100″
 analyzer type=”index”
 tokenizer class=”solr.WhitespaceTokenizerFactory”/
 filter class=”solr.LowerCaseFilterFactory”/
 filter class=”solr.EdgeNGramFilterFactory” minGramSize=”3″
 maxGramSize=”25″/
 /analyzer
 analyzer type=”query”
 tokenizer class=”solr.KeywordTokenizerFactory”/
 filter class=”solr.LowerCaseFilterFactory”/
 /analyzer
 /fieldType

 field name=”user_query” type=”query” indexed=”true” stored=”true”
 omitNorms=”true” omitTermFreqAndPositions=”true” /
 field name=”edgy_user_query” type=”edgytext” indexed=”true” stored=”true”
 omitNorms=”true” omitTermFreqAndPositions=”true” /

 defaultSearchFieldedgy_user_query/defaultSearchField
 copyField source=”user_query” dest=”edgy_user_query”/

 ==

 When I search for the term apple.

 It is returning results for pineapple vers apple, milk with apple,
 apple milk shake ...

 Is there any other way to overcome this problem?

 Thanks,

 Johnny


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2329370.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread johnnyisrael

Hi Eric,

What I want here is, lets say I have 3 documents like 

[pineapple vers apple, milk with apple, apple milk shake ]

and If i search for apple, it should return only apple milk shake
because that term alone starts with the letter apple which I typed in. It
should not bring others and if I type milk it should return only milk
with apple

I want an output Similar like a Google auto suggest.

Is there a way to achieve  this without encapsulating with double quotes.

Thanks,

Johnny
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2333602.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Jonathan Rochkind
I haven't figured out any way to achieve that AT ALL without making a 
seperate Solr index just to serve autosuggest queries. At least when you 
want to auto-suggest on a multi-value field. Someone posted a crazy 
tricky way to do it with a single-valued field a while ago.  If you 
can/are willing to make a seperate Solr index with a schema set up for 
auto-suggest specifically, it's easy. But from an existing schema, where 
you want to auto-suggest just based on the values in one field, it's a 
multi-valued field, and you want to allow matches in the middle of the 
field -- I don't think there's a way to do it.


On 1/25/2011 3:03 PM, johnnyisrael wrote:

Hi Eric,

What I want here is, lets say I have 3 documents like

[pineapple vers apple, milk with apple, apple milk shake ]

and If i search for apple, it should return only apple milk shake
because that term alone starts with the letter apple which I typed in. It
should not bring others and if I type milk it should return only milk
with apple

I want an output Similar like a Google auto suggest.

Is there a way to achieve  this without encapsulating with double quotes.

Thanks,

Johnny


Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Markus Jelsma
Then you don't need NGrams at all. A wildcard will suffice or you can use the 
TermsComponent.

If these strings are indexed as single tokens (KeywordTokenizer with 
LowercaseFilter) you can simply do field:app* to retrieve the apple milk 
shake. You can also use the string field type but then you must make sure the 
values are already lowercased before indexing.

Be careful though, there is no query time analysis for wildcard (and fuzzy) 
queries so make sure 

 Hi Eric,
 
 What I want here is, lets say I have 3 documents like
 
 [pineapple vers apple, milk with apple, apple milk shake ]
 
 and If i search for apple, it should return only apple milk shake
 because that term alone starts with the letter apple which I typed in. It
 should not bring others and if I type milk it should return only milk
 with apple
 
 I want an output Similar like a Google auto suggest.
 
 Is there a way to achieve  this without encapsulating with double quotes.
 
 Thanks,
 
 Johnny


Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Markus Jelsma
Oh, i should perhaps mention that EdgeNGrams will yield results a lot quicker 
than using wildcards at the cost of a larger index. You should, of course, use 
EdgeNGrams if you worry about performance and have a huge index and a number 
of queries per second.

 Then you don't need NGrams at all. A wildcard will suffice or you can use
 the TermsComponent.
 
 If these strings are indexed as single tokens (KeywordTokenizer with
 LowercaseFilter) you can simply do field:app* to retrieve the apple milk
 shake. You can also use the string field type but then you must make sure
 the values are already lowercased before indexing.
 
 Be careful though, there is no query time analysis for wildcard (and fuzzy)
 queries so make sure
 
  Hi Eric,
  
  What I want here is, lets say I have 3 documents like
  
  [pineapple vers apple, milk with apple, apple milk shake ]
  
  and If i search for apple, it should return only apple milk shake
  because that term alone starts with the letter apple which I typed in.
  It should not bring others and if I type milk it should return only
  milk with apple
  
  I want an output Similar like a Google auto suggest.
  
  Is there a way to achieve  this without encapsulating with double quotes.
  
  Thanks,
  
  Johnny


Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread mesenthil

The index contains around 1.5 million documents. As this is used for
autosuggest feature, performance is an important factor. 

So it looks like, using edgeNgram it is difficult to achieve the the
following 

Result should return only those terms where search letter is matching with
the first word only. For example, when we type M,  it should return
Mumford and Sons and not jackson Michael. 


Jonathan,

Is it possible to achieve this when we have separate index using edgeNgram?
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2334538.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Jonathan Rochkind
Ah, sorry, I got confused about your requirements, if you just want to 
match at the beginning of the field, it may be more possible.  Using 
edgegrams or wildcard. If you have a single-valued field. Do you have a 
single-valued or a multi-valued field?  That is, does each document have 
just one value, or multiple?   I still get confused about how to do it 
with edgegrams, even with single-valued field, but I think maybe it's 
possible.


_Definitely_ possible, with or without edgegrams, if you are 
willing/able to make a completely seperate Solr index where each term 
for auto-suggest is a document.  Yes.


The problem lies in what results are. In general, Solr's results are 
the documents you have in the Solr index. Thus it makes everything a lot 
easier to deal with if you have an index where each document in the 
index is a term for auto-suggest.   But that doesnt' always meet 
requirements if you need to auto-suggest within existing fq's and such, 
and of course it takes more resources to run an additional solr index.


On 1/25/2011 5:03 PM, mesenthil wrote:

The index contains around 1.5 million documents. As this is used for
autosuggest feature, performance is an important factor.

So it looks like, using edgeNgram it is difficult to achieve the the
following

Result should return only those terms where search letter is matching with
the first word only. For example, when we type M,  it should return
Mumford and Sons and not jackson Michael.


Jonathan,

Is it possible to achieve this when we have separate index using edgeNgram?



Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread mesenthil

Right now our configuration says multivalues=true. But that need not be
true in our case. Will make it false and try and update this thread with
more details..
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2334627.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Erick Erickson
OK, try this.

Use some analysis chain for your field like:

analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
/analyzer

This can be a multiValued field, BTW.

now use the TermsComponent to fetch your data. See:
http://wiki.apache.org/solr/TermsComponent

and specify terms.prefix=apple e.g.
http://localhost:8983/solr/terms?terms.prefix=appterms.fl=blivet

The return list should be what you want. Note that the returned
values will be lower cased, and you can only specify
lower case in your search term (all because of specifying
the lowercase filter in my example).

This should be very fast no matter what your index size, as the
return list size defaults to 10 (though you can specify different
numbers).

Best
Erick

On Tue, Jan 25, 2011 at 3:03 PM, johnnyisrael johnnyi.john...@gmail.comwrote:


 Hi Eric,

 What I want here is, lets say I have 3 documents like

 [pineapple vers apple, milk with apple, apple milk shake ]

 and If i search for apple, it should return only apple milk shake
 because that term alone starts with the letter apple which I typed in. It
 should not bring others and if I type milk it should return only milk
 with apple

 I want an output Similar like a Google auto suggest.

 Is there a way to achieve  this without encapsulating with double quotes.

 Thanks,

 Johnny
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2333602.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Query performance issue while using EdgeNGram

2010-12-22 Thread Shanmugavel SRD

1) Thanks for this update. I have to use 'WhiteSpaceTokenizer'
2) I have to suggest the whole query itself (Say name or title)
3) Could you please let me know if there is a way to find the evicted docs?
4) Yes, we are seeing improvement in the response time if we optimize. But
still for some queries QTime is more than 8 secs. It is a 'Blocker' for us.
Could you please suggest any to reduce the QTime to 1 secs.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-performance-issue-while-using-EdgeNGram-tp2097056p2130751.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query performance issue while using EdgeNGram

2010-12-22 Thread Erick Erickson
Hmmm. find evicted docs? If you mean find out how many docs are deleted,
look
on the admin schema browser page and the difference between MaxDoc and
NumDocs
is the number of deleted documents.

You say for some queries the QTime is more than 8 secs. What happens if
you
re-run that query a bit later? The reason I ask is if you're not warming the
cache that
that particular query uses, you may be seeing cache loading time here.

Look at the admin stats page, especially for evictions. It's also possible
that your
caches are being reclaimed for some queries and you're seeing response
time spikes when the caches are re-loaded.

Best
Erick

On Wed, Dec 22, 2010 at 7:10 AM, Shanmugavel SRD
srdshanmuga...@gmail.comwrote:


 1) Thanks for this update. I have to use 'WhiteSpaceTokenizer'
 2) I have to suggest the whole query itself (Say name or title)
 3) Could you please let me know if there is a way to find the evicted docs?
 4) Yes, we are seeing improvement in the response time if we optimize. But
 still for some queries QTime is more than 8 secs. It is a 'Blocker' for us.
 Could you please suggest any to reduce the QTime to 1 secs.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Query-performance-issue-while-using-EdgeNGram-tp2097056p2130751.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Query performance issue while using EdgeNGram

2010-12-16 Thread Erick Erickson
})(.*)? replacement=$1 replace=all /
/analyzer
  /fieldType
 /types
 fields
   field name=name type=string indexed=false stored=true/
   field name=id type=string indexed=true stored=true /
   field name=score type=sfloat indexed=true stored=false /
   field name=autosuggest type=typeahead indexed=true
 stored=false/
 /fields
 uniqueKeyid/uniqueKey
 defaultSearchFieldautosuggest/defaultSearchField
 copyField source=name dest=autosuggest/
 solrQueryParser defaultOperator=AND/
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Query-performance-issue-while-using-EdgeNGram-tp2097056p2097056.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: EdgeNGram relevancy

2010-11-16 Thread Robert Gründler
thanks for the explanation.

the results for the autocompletion are pretty good now, but we still have a 
small problem. 

When there are hits in the edgytext2 fields, results which only have hits in 
the edgytext field
should not be returned at all.

Example:

Query: Martin Sco

Current Results (in that order):

- Martin Scorsese
- Martin Lawrence
- Joseph Martin

However, in an autocompletion context, only Martin Scorsese makes sense, the 
2 others are logically
not correct.

I'm not sure if this can be solved on the solr side, or if we should implement 
the logic in the
application.


thanks!

-robert







On Nov 12, 2010, at 12:13 AM, Jonathan Rochkind wrote:

 Without the parens, the edgytext: only applied to Mr, the default field 
 still applied to Scorcese.
 
 The double quotes are neccesary in the second case (rather than parens), 
 because on a non-tokenized field because the standard query parser will 
 pre-tokenize on whitespace before sending individual white-space seperated 
 words to match the index. If the index includes multi-word tokens with 
 internal whitespace, they will never match. But the standard query parser 
 doesn't pre-tokenize like this, it passes the whole phrase to the index 
 intact.
 
 Robert Gründler wrote:
 Did you run your query without using () and  operators? If yes can you 
 try this?
 q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0

 
 I didn't use () and  in my query before. Using the query with those 
 operators
 works now, stopwords are thrown out as the should, thanks.
 
 However, i don't understand how the () and  operators affect the 
 StopWordFilter.
 
 Could you give a brief explanation for the above example?
 
 thanks!
 
 
 -robert
 
 
 
 
 
  



Re: EdgeNGram relevancy

2010-11-16 Thread Robert Gründler
it seems adding the '+' (required) operator to each term in a multi-term query 
does the trick:

http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#+

ie: edgytext2:(+Martin +Sco)


-robert



On Nov 16, 2010, at 8:52 PM, Robert Gründler wrote:

 thanks for the explanation.
 
 the results for the autocompletion are pretty good now, but we still have a 
 small problem. 
 
 When there are hits in the edgytext2 fields, results which only have hits 
 in the edgytext field
 should not be returned at all.
 
 Example:
 
 Query: Martin Sco
 
 Current Results (in that order):
 
 - Martin Scorsese
 - Martin Lawrence
 - Joseph Martin
 
 However, in an autocompletion context, only Martin Scorsese makes sense, 
 the 2 others are logically
 not correct.
 
 I'm not sure if this can be solved on the solr side, or if we should 
 implement the logic in the
 application.
 
 
 thanks!
 
 -robert
 
 
 
 
 
 
 
 On Nov 12, 2010, at 12:13 AM, Jonathan Rochkind wrote:
 
 Without the parens, the edgytext: only applied to Mr, the default field 
 still applied to Scorcese.
 
 The double quotes are neccesary in the second case (rather than parens), 
 because on a non-tokenized field because the standard query parser will 
 pre-tokenize on whitespace before sending individual white-space seperated 
 words to match the index. If the index includes multi-word tokens with 
 internal whitespace, they will never match. But the standard query parser 
 doesn't pre-tokenize like this, it passes the whole phrase to the index 
 intact.
 
 Robert Gründler wrote:
 Did you run your query without using () and  operators? If yes can you 
 try this?
 q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0
 
 
 I didn't use () and  in my query before. Using the query with those 
 operators
 works now, stopwords are thrown out as the should, thanks.
 
 However, i don't understand how the () and  operators affect the 
 StopWordFilter.
 
 Could you give a brief explanation for the above example?
 
 thanks!
 
 
 -robert
 
 
 
 
 
 
 



EdgeNGram relevancy

2010-11-11 Thread Robert Gründler
Hi,

consider the following fieldtype (used for autocompletion):

  fieldType name=edgytext class=solr.TextField positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true / 
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=25 /
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
   /analyzer
  /fieldType


This works fine as long as the query string is a single word. For multiple 
words, the ranking is weird though.

Example:

Query String: Bill Cl

Result (in that order):

- Clyde Phillips
- Clay Rogers
- Roger Cloud
- Bill Clinton

Bill Clinton should have the highest rank in that case.  

Has anyone an idea how to to configure this fieldtype to make matches in both 
tokens rank higher than those who match in either token?


thanks!


-robert





Re: EdgeNGram relevancy

2010-11-11 Thread Ahmet Arslan
You can add an additional field, with using KeywordTokenizerFactory instead of 
WhitespaceTokenizerFactory. And query both these fields with an OR operator. 

edgytext:(Bill Cl) OR edgytext2:Bill Cl

You can even apply boost so that begins with matches comes first.

--- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote:

 From: Robert Gründler rob...@dubture.com
 Subject: EdgeNGram relevancy
 To: solr-user@lucene.apache.org
 Date: Thursday, November 11, 2010, 5:51 PM
 Hi,
 
 consider the following fieldtype (used for
 autocompletion):
 
   fieldType name=edgytext class=solr.TextField
 positionIncrementGap=100
    analyzer type=index
      tokenizer
 class=solr.WhitespaceTokenizerFactory/
      filter
 class=solr.LowerCaseFilterFactory/
      filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true
 /     
      filter
 class=solr.PatternReplaceFilterFactory pattern=([^a-z])
 replacement= replace=all /
      filter
 class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=25 /
    /analyzer
    analyzer type=query
      tokenizer
 class=solr.WhitespaceTokenizerFactory/
      filter
 class=solr.LowerCaseFilterFactory/
      filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
      filter
 class=solr.PatternReplaceFilterFactory pattern=([^a-z])
 replacement= replace=all /
    /analyzer
   /fieldType
 
 
 This works fine as long as the query string is a single
 word. For multiple words, the ranking is weird though.
 
 Example:
 
 Query String: Bill Cl
 
 Result (in that order):
 
 - Clyde Phillips
 - Clay Rogers
 - Roger Cloud
 - Bill Clinton
 
 Bill Clinton should have the highest rank in that
 case.  
 
 Has anyone an idea how to to configure this fieldtype to
 make matches in both tokens rank higher than those who match
 in either token?
 
 
 thanks!
 
 
 -robert
 
 
 
 





Re: EdgeNGram relevancy

2010-11-11 Thread Robert Gründler
thanks a lot, that setup works pretty well now.

the only problem now is that the StopWords do not work that good anymore. I'll 
provide an example, but first the 2 fieldtypes:

  !-- autocomplete field which finds matches inside strings (scor matches 
Martin Scorsese) --
  
  fieldType name=edgytext class=solr.TextField positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true / 
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=25 /
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
   /analyzer
  /fieldType
  
  !-- autocomplete field which finds startsWith matches only (scor matches 
only Scorpio, but not Martin Scorsese) --  

  fieldType name=edgytext2 class=solr.TextField positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=25 /
   /analyzer
   analyzer type=query
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
   /analyzer
  /fieldType


This setup now makes troubles regarding StopWords, here's an example:

Let's say the index contains 2 Strings: Mr Martin Scorsese and Martin 
Scorsese. Mr is in the stopword list.

Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0

This way, the only result i get is Mr Martin Scorsese, because the strict 
field edgytext2 is boosted by 2.0. 

Any idea why in this case Martin Scorsese is not in the result at all?


thanks again!


-robert






On Nov 11, 2010, at 5:57 PM, Ahmet Arslan wrote:

 You can add an additional field, with using KeywordTokenizerFactory instead 
 of WhitespaceTokenizerFactory. And query both these fields with an OR 
 operator. 
 
 edgytext:(Bill Cl) OR edgytext2:Bill Cl
 
 You can even apply boost so that begins with matches comes first.
 
 --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote:
 
 From: Robert Gründler rob...@dubture.com
 Subject: EdgeNGram relevancy
 To: solr-user@lucene.apache.org
 Date: Thursday, November 11, 2010, 5:51 PM
 Hi,
 
 consider the following fieldtype (used for
 autocompletion):
 
   fieldType name=edgytext class=solr.TextField
 positionIncrementGap=100
analyzer type=index
  tokenizer
 class=solr.WhitespaceTokenizerFactory/
  filter
 class=solr.LowerCaseFilterFactory/
  filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true
 / 
  filter
 class=solr.PatternReplaceFilterFactory pattern=([^a-z])
 replacement= replace=all /
  filter
 class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=25 /
/analyzer
analyzer type=query
  tokenizer
 class=solr.WhitespaceTokenizerFactory/
  filter
 class=solr.LowerCaseFilterFactory/
  filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
  filter
 class=solr.PatternReplaceFilterFactory pattern=([^a-z])
 replacement= replace=all /
/analyzer
   /fieldType
 
 
 This works fine as long as the query string is a single
 word. For multiple words, the ranking is weird though.
 
 Example:
 
 Query String: Bill Cl
 
 Result (in that order):
 
 - Clyde Phillips
 - Clay Rogers
 - Roger Cloud
 - Bill Clinton
 
 Bill Clinton should have the highest rank in that
 case.  
 
 Has anyone an idea how to to configure this fieldtype to
 make matches in both tokens rank higher than those who match
 in either token?
 
 
 thanks!
 
 
 -robert
 
 
 
 
 
 
 



Re: EdgeNGram relevancy

2010-11-11 Thread Ahmet Arslan
 This setup now makes troubles regarding StopWords, here's
 an example:
 
 Let's say the index contains 2 Strings: Mr Martin
 Scorsese and Martin Scorsese. Mr is in the stopword
 list.
 
 Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0
 
 This way, the only result i get is Mr Martin Scorsese,
 because the strict field edgytext2 is boosted by 2.0. 
 
 Any idea why in this case Martin Scorsese is not in the
 result at all?

Did you run your query without using () and  operators? If yes can you try 
this?
q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0

If no can you paste output of debugQuery=on


  


Re: EdgeNGram relevancy

2010-11-11 Thread Nick Martin

On 12 Nov 2010, at 01:46, Ahmet Arslan iori...@yahoo.com wrote:

 This setup now makes troubles regarding StopWords, here's
 an example:
 
 Let's say the index contains 2 Strings: Mr Martin
 Scorsese and Martin Scorsese. Mr is in the stopword
 list.
 
 Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0
 
 This way, the only result i get is Mr Martin Scorsese,
 because the strict field edgytext2 is boosted by 2.0. 
 
 Any idea why in this case Martin Scorsese is not in the
 result at all?
 
 Did you run your query without using () and  operators? If yes can you try 
 this?
 q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0
 
 If no can you paste output of debugQuery=on
 
 
 

This would still not deal with the problem of removing stop words from the 
indexing and query analysis stages.

I really need something that will allow that and give a single token as in the 
example below.

Best

Nick

Re: EdgeNGram relevancy

2010-11-11 Thread Andy
Could anyone help me understand what does Clyde Phillips appear in the 
results for Bill Cl??

Clyde Phillips doesn't produce any EdgeNGram that would match Bill Cl, so 
why is it even in the results?

Thanks.

--- On Thu, 11/11/10, Ahmet Arslan iori...@yahoo.com wrote:

 You can add an additional field, with
 using KeywordTokenizerFactory instead of
 WhitespaceTokenizerFactory. And query both these fields with
 an OR operator. 
 
 edgytext:(Bill Cl) OR edgytext2:Bill Cl
 
 You can even apply boost so that begins with matches comes
 first.
 
 --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com
 wrote:
 
  From: Robert Gründler rob...@dubture.com
  Subject: EdgeNGram relevancy
  To: solr-user@lucene.apache.org
  Date: Thursday, November 11, 2010, 5:51 PM
  Hi,
  
  consider the following fieldtype (used for
  autocompletion):
  
    fieldType name=edgytext
 class=solr.TextField
  positionIncrementGap=100
     analyzer type=index
       tokenizer
  class=solr.WhitespaceTokenizerFactory/
       filter
  class=solr.LowerCaseFilterFactory/
       filter
  class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true
  /     
       filter
  class=solr.PatternReplaceFilterFactory
 pattern=([^a-z])
  replacement= replace=all /
       filter
  class=solr.EdgeNGramFilterFactory minGramSize=1
  maxGramSize=25 /
     /analyzer
     analyzer type=query
       tokenizer
  class=solr.WhitespaceTokenizerFactory/
       filter
  class=solr.LowerCaseFilterFactory/
       filter
  class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true
 /
       filter
  class=solr.PatternReplaceFilterFactory
 pattern=([^a-z])
  replacement= replace=all /
     /analyzer
    /fieldType
  
  
  This works fine as long as the query string is a
 single
  word. For multiple words, the ranking is weird
 though.
  
  Example:
  
  Query String: Bill Cl
  
  Result (in that order):
  
  - Clyde Phillips
  - Clay Rogers
  - Roger Cloud
  - Bill Clinton
  
  Bill Clinton should have the highest rank in that
  case.  
  
  Has anyone an idea how to to configure this fieldtype
 to
  make matches in both tokens rank higher than those who
 match
  in either token?
  
  
  thanks!
  
  
  -robert
  
  
  
  
 
 
 
 





Re: EdgeNGram relevancy

2010-11-11 Thread Robert Gründler
according to the fieldtype i posted previously, i think it's because of:

1. WhiteSpaceTokenizer splits the String Clyde Phillips into 2 tokens: 
Clyde and Phillips
2. EdgeNGramFilter gets the 2 tokens, and creates an EdgeNGram for each token: 
C Cl Cly ...   AND  P Ph Phi ...

The Query String Bill Cl gets split up in 2 Tokens Bill and Cl by the 
WhitespaceTokenizer.

This creates a match for the 2nd token Ci of the query, and one of the 
subtokens the EdgeNGramFilter created: Cl.


-robert




On Nov 11, 2010, at 21:34 , Andy wrote:

 Could anyone help me understand what does Clyde Phillips appear in the 
 results for Bill Cl??
 
 Clyde Phillips doesn't produce any EdgeNGram that would match Bill Cl, so 
 why is it even in the results?
 
 Thanks.
 
 --- On Thu, 11/11/10, Ahmet Arslan iori...@yahoo.com wrote:
 
 You can add an additional field, with
 using KeywordTokenizerFactory instead of
 WhitespaceTokenizerFactory. And query both these fields with
 an OR operator. 
 
 edgytext:(Bill Cl) OR edgytext2:Bill Cl
 
 You can even apply boost so that begins with matches comes
 first.
 
 --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com
 wrote:
 
 From: Robert Gründler rob...@dubture.com
 Subject: EdgeNGram relevancy
 To: solr-user@lucene.apache.org
 Date: Thursday, November 11, 2010, 5:51 PM
 Hi,
 
 consider the following fieldtype (used for
 autocompletion):
 
   fieldType name=edgytext
 class=solr.TextField
 positionIncrementGap=100
analyzer type=index
  tokenizer
 class=solr.WhitespaceTokenizerFactory/
  filter
 class=solr.LowerCaseFilterFactory/
  filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true
 / 
  filter
 class=solr.PatternReplaceFilterFactory
 pattern=([^a-z])
 replacement= replace=all /
  filter
 class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=25 /
/analyzer
analyzer type=query
  tokenizer
 class=solr.WhitespaceTokenizerFactory/
  filter
 class=solr.LowerCaseFilterFactory/
  filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true
 /
  filter
 class=solr.PatternReplaceFilterFactory
 pattern=([^a-z])
 replacement= replace=all /
/analyzer
   /fieldType
 
 
 This works fine as long as the query string is a
 single
 word. For multiple words, the ranking is weird
 though.
 
 Example:
 
 Query String: Bill Cl
 
 Result (in that order):
 
 - Clyde Phillips
 - Clay Rogers
 - Roger Cloud
 - Bill Clinton
 
 Bill Clinton should have the highest rank in that
 case.  
 
 Has anyone an idea how to to configure this fieldtype
 to
 make matches in both tokens rank higher than those who
 match
 in either token?
 
 
 thanks!
 
 
 -robert
 
 
 
 
 
 
 
 
 
 
 



Re: EdgeNGram relevancy

2010-11-11 Thread Andy
Ah I see. Thanks for the explanation.

Could you set the defaultOperator to AND? That way both Bill and Cl must 
be a match and that would exclude Clyde Phillips.


--- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote:

 From: Robert Gründler rob...@dubture.com
 Subject: Re: EdgeNGram relevancy
 To: solr-user@lucene.apache.org
 Date: Thursday, November 11, 2010, 3:51 PM
 according to the fieldtype i posted
 previously, i think it's because of:
 
 1. WhiteSpaceTokenizer splits the String Clyde Phillips
 into 2 tokens: Clyde and Phillips
 2. EdgeNGramFilter gets the 2 tokens, and creates an
 EdgeNGram for each token: C Cl Cly
 ...   AND  P Ph Phi ...
 
 The Query String Bill Cl gets split up in 2 Tokens Bill
 and Cl by the WhitespaceTokenizer.
 
 This creates a match for the 2nd token Ci of the query,
 and one of the subtokens the EdgeNGramFilter created:
 Cl.
 
 
 -robert
 
 
 
 
 On Nov 11, 2010, at 21:34 , Andy wrote:
 
  Could anyone help me understand what does Clyde
 Phillips appear in the results for Bill Cl??
  
  Clyde Phillips doesn't produce any EdgeNGram that
 would match Bill Cl, so why is it even in the results?
  
  Thanks.
  
  --- On Thu, 11/11/10, Ahmet Arslan iori...@yahoo.com
 wrote:
  
  You can add an additional field, with
  using KeywordTokenizerFactory instead of
  WhitespaceTokenizerFactory. And query both these
 fields with
  an OR operator. 
  
  edgytext:(Bill Cl) OR edgytext2:Bill Cl
  
  You can even apply boost so that begins with
 matches comes
  first.
  
  --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com
  wrote:
  
  From: Robert Gründler rob...@dubture.com
  Subject: EdgeNGram relevancy
  To: solr-user@lucene.apache.org
  Date: Thursday, November 11, 2010, 5:51 PM
  Hi,
  
  consider the following fieldtype (used for
  autocompletion):
  
    fieldType
 name=edgytext
  class=solr.TextField
  positionIncrementGap=100
     analyzer type=index
       tokenizer
  class=solr.WhitespaceTokenizerFactory/
       filter
  class=solr.LowerCaseFilterFactory/
       filter
  class=solr.StopFilterFactory
 ignoreCase=true
  words=stopwords.txt
 enablePositionIncrements=true
  /     
           filter
  class=solr.PatternReplaceFilterFactory
  pattern=([^a-z])
  replacement= replace=all /
       filter
  class=solr.EdgeNGramFilterFactory
 minGramSize=1
  maxGramSize=25 /
     /analyzer
     analyzer type=query
       tokenizer
  class=solr.WhitespaceTokenizerFactory/
       filter
  class=solr.LowerCaseFilterFactory/
       filter
  class=solr.StopFilterFactory
 ignoreCase=true
  words=stopwords.txt
 enablePositionIncrements=true
  /
           filter
  class=solr.PatternReplaceFilterFactory
  pattern=([^a-z])
  replacement= replace=all /
     /analyzer
    /fieldType
  
  
  This works fine as long as the query string is
 a
  single
  word. For multiple words, the ranking is
 weird
  though.
  
  Example:
  
  Query String: Bill Cl
  
  Result (in that order):
  
  - Clyde Phillips
  - Clay Rogers
  - Roger Cloud
  - Bill Clinton
  
  Bill Clinton should have the highest rank in
 that
  case.  
  
  Has anyone an idea how to to configure this
 fieldtype
  to
  make matches in both tokens rank higher than
 those who
  match
  in either token?
  
  
  thanks!
  
  
  -robert
  
  
  
  
  
  
  
  
  
  
  
 
 





Re: EdgeNGram relevancy

2010-11-11 Thread Robert Gründler
 
 Did you run your query without using () and  operators? If yes can you try 
 this?
 q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0

I didn't use () and  in my query before. Using the query with those operators
works now, stopwords are thrown out as the should, thanks.

However, i don't understand how the () and  operators affect the 
StopWordFilter.

Could you give a brief explanation for the above example?

thanks!


-robert






Re: EdgeNGram relevancy

2010-11-11 Thread Jonathan Rochkind
Without the parens, the edgytext: only applied to Mr, the default 
field still applied to Scorcese.


The double quotes are neccesary in the second case (rather than parens), 
because on a non-tokenized field because the standard query parser will 
pre-tokenize on whitespace before sending individual white-space 
seperated words to match the index. If the index includes multi-word 
tokens with internal whitespace, they will never match. But the standard 
query parser doesn't pre-tokenize like this, it passes the whole 
phrase to the index intact.


Robert Gründler wrote:

Did you run your query without using () and  operators? If yes can you try 
this?
q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0



I didn't use () and  in my query before. Using the query with those operators
works now, stopwords are thrown out as the should, thanks.

However, i don't understand how the () and  operators affect the 
StopWordFilter.

Could you give a brief explanation for the above example?

thanks!


-robert