Solr 5: hit highlight with NGram/EdgeNgram-fields
with Solr 4.10.3 I was advised to set luceneMatchVersion to 4.3 to make hit highlight work with NGram/EdgeNgram- fields, like this: filter class=solr.EdgeNGramFilterFactory maxGramSize=20 minGramSize=1 luceneMatchVersion=4.3/ In Solr 5 and 5.1 this seems to not work any more. The complete word is highlighted, not just the part that matches the search term. In Solr admin analysis page it again does not show the proper end-offset positions. What is shows is this: LENGTF textt te tes test raw_bytes [74][74 65] [74 65 73] [74 65 73 74] start 0 0 0 0 end 4 4 4 4 positionLength 1 1 1 1 typewordwordwordword position1 1 1 1 In Solr 4.10.3 with LuceneMatchVersion set to 4.3 end offset would be: 1, 2, 3, 4 and hit higlight would work. Any advise on making hit highlight with (Edge)NGram -fields would be highly appreciated! Thanks, Bjørn
Combination of edgengram and ngram
I am interested in a new filter type, one that would combine edgengram and ngram. The idea is that it would create all ngrams specified by the min/max size, but the ngrams that happen to be edgengrams (specifically the left side) would get an index-time boost. Optionally the boost would be higher if it came from the first token. The use case: An automatic autosuggest dropdown that populates as a user types into a search box. The index would have one field and it would be built from a manually produced list of suggested search phrases. The boosts mentioned would make it so that matches from the beginning of a word, and especially from the beginning of the entire suggested phrase, would be returned first. I could get a similar effect by using a copyfield, analyzing one field with ngrams and the other with edgengrams, then using edismax to put a boost on the edge version. I will start with this method, but using copyfield makes the index bigger, and using dismax makes the ultimate parsed queries more complicated. If I can avoid the copyfield, the index will be smaller and the queries very simple, which should make for very high speed. I will take a look at the source code, but I'm a bit of a Java novice. Does anyone have the knowledge, desire, and time to crank this one out quickly? Is it possible someone has already written such a filter? Thanks, Shawn
Problem using EdgeNGram
Hi, I am using solr 3.3 with SolrJ. I am trying to use EdgeNgram to power auto suggest feature in my application. My understanding is that using EdgeNgram would mean that results will only be returned for records starting with the search criteria but this is not happening for me. For example if i search for tr, i get results as following: Greenham Trading 6 IT Training Publications AA Training Below are details of my configuration: fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=15 / /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType field name=businessName type=edgytext indexed=true stored=true required=true omitNorms=true omitTermFreqAndPositions=true / Any ideas why this is happening will be much appreciated. Thanks.
Re: Problem using EdgeNGram
Try using KeywordTokenizerFactory instead of StandardTokenizerFactory to get the results you want. -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-using-EdgeNGram-tp3355132p3355211.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Edgengram
Hi Tomás, Thank you very much for your suggestion. I took another crack at it using your recommendation and it worked ideally. The only thing I had to change was analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / /analyzer to analyzer type=query tokenizer class=solr.LowerCaseTokenizerFactory / /analyzer The first did not produce any results but the second worked beautifully. Thanks! Brian Lamb 2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com ...or also use the LowerCaseTokenizerFactory at query time for consistency, but not the edge ngram filter. 2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com Hi Brian, I don't know if I understand what you are trying to achieve. You want the term query abcdefg to have an idf of 1 insead of 7? I think using the KeywordTokenizerFilterFactory at query time should work. I would be something like: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer type=index tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 side=front / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / /analyzer /fieldType this way, at query time abcdefg won't be turned to a ab abc abcd abcde abcdef abcdefg. At index time it will. Regards, Tomás On Tue, May 31, 2011 at 1:07 PM, Brian Lamb brian.l...@journalexperts.com wrote: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 side=front / /analyzer /fieldType I believe I used that link when I initially set up the field and it worked great (and I'm still using it in other places). In this particular example however it does not appear to be practical for me. I mentioned that I have a similarity class that returns 1 for the idf and in the case of an edgengram, it returns 1 * length of the search string. Thanks, Brian Lamb On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com bmdakshinamur...@gmail.com wrote: Can you specify the analyzer you are using for your queries? May be you could use a KeywordAnalyzer for your queries so you don't end up matching parts of your query. http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ This should help you. On Tue, May 31, 2011 at 8:24 PM, Brian Lamb brian.l...@journalexperts.comwrote: In this particular case, I will be doing a solr search based on user preferences. So I will not be depending on the user to type abcdefg. That will be automatically generated based on user selections. The contents of the field do not contain spaces and since I am created the search parameters, case isn't important either. Thanks, Brian Lamb On Tue, May 31, 2011 at 9:44 AM, Erick Erickson erickerick...@gmail.com wrote: That'll work for your case, although be aware that string types aren't analyzed at all, so case matters, as do spaces etc. What is the use-case here? If you explain it a bit there might be better answers Best Erick On Fri, May 27, 2011 at 9:17 AM, Brian Lamb brian.l...@journalexperts.com wrote: For this, I ended up just changing it to string and using abcdefg* to match. That seems to work so far. Thanks, Brian Lamb On Wed, May 25, 2011 at 4:53 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, I'm running into some confusion with the way edgengram works. I have the field set up as: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=100 side=front / /analyzer /fieldType I've also set up my own similarity class that returns 1 as the idf score. What I've found this does is if I match a string abcdefg against a field containing abcdefghijklmnop, then the idf will score that as a 7: 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2) I get why that's happening, but is there a way to avoid that? Do I need to do a new field type to achieve the desired affect? Thanks, Brian Lamb -- Thanks and Regards, DakshinaMurthy BM
Re: Edgengram
Be a little careful here. LowerCaseTokenizerFactory is different than KeywordTokenizerFactory. LowerCaseTokenizerFactory will give you more than one term. e.g. the string Intelligence can't be MeaSurEd will give you 5 terms, any of which may match. i.e. intelligence, can, t, be, measured. whereas KeywordTokenizerFactory followed, by, say LowerCaseFilter would give you exactly one token: intelligence can't be measured. So searching for measured would get a hit in the first case but not in the second. Searching for intellig* would hit both. Neither is better, just make sure they do what you want! This page will help a lot: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory as will the admin/analysis page. Best Erick On Wed, Jun 1, 2011 at 10:43 AM, Brian Lamb brian.l...@journalexperts.com wrote: Hi Tomás, Thank you very much for your suggestion. I took another crack at it using your recommendation and it worked ideally. The only thing I had to change was analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / /analyzer to analyzer type=query tokenizer class=solr.LowerCaseTokenizerFactory / /analyzer The first did not produce any results but the second worked beautifully. Thanks! Brian Lamb 2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com ...or also use the LowerCaseTokenizerFactory at query time for consistency, but not the edge ngram filter. 2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com Hi Brian, I don't know if I understand what you are trying to achieve. You want the term query abcdefg to have an idf of 1 insead of 7? I think using the KeywordTokenizerFilterFactory at query time should work. I would be something like: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer type=index tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 side=front / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / /analyzer /fieldType this way, at query time abcdefg won't be turned to a ab abc abcd abcde abcdef abcdefg. At index time it will. Regards, Tomás On Tue, May 31, 2011 at 1:07 PM, Brian Lamb brian.l...@journalexperts.com wrote: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 side=front / /analyzer /fieldType I believe I used that link when I initially set up the field and it worked great (and I'm still using it in other places). In this particular example however it does not appear to be practical for me. I mentioned that I have a similarity class that returns 1 for the idf and in the case of an edgengram, it returns 1 * length of the search string. Thanks, Brian Lamb On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com bmdakshinamur...@gmail.com wrote: Can you specify the analyzer you are using for your queries? May be you could use a KeywordAnalyzer for your queries so you don't end up matching parts of your query. http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ This should help you. On Tue, May 31, 2011 at 8:24 PM, Brian Lamb brian.l...@journalexperts.comwrote: In this particular case, I will be doing a solr search based on user preferences. So I will not be depending on the user to type abcdefg. That will be automatically generated based on user selections. The contents of the field do not contain spaces and since I am created the search parameters, case isn't important either. Thanks, Brian Lamb On Tue, May 31, 2011 at 9:44 AM, Erick Erickson erickerick...@gmail.com wrote: That'll work for your case, although be aware that string types aren't analyzed at all, so case matters, as do spaces etc. What is the use-case here? If you explain it a bit there might be better answers Best Erick On Fri, May 27, 2011 at 9:17 AM, Brian Lamb brian.l...@journalexperts.com wrote: For this, I ended up just changing it to string and using abcdefg* to match. That seems to work so far. Thanks, Brian Lamb On Wed, May 25, 2011 at 4:53 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, I'm running into some confusion with the way edgengram works. I have the field set up as: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory
Re: Edgengram
I think in my case LowerCaseTokenizerFactory will be sufficient because there will never be spaces in this particular field. But thank you for the useful link! Thanks, Brian Lamb On Wed, Jun 1, 2011 at 11:44 AM, Erick Erickson erickerick...@gmail.comwrote: Be a little careful here. LowerCaseTokenizerFactory is different than KeywordTokenizerFactory. LowerCaseTokenizerFactory will give you more than one term. e.g. the string Intelligence can't be MeaSurEd will give you 5 terms, any of which may match. i.e. intelligence, can, t, be, measured. whereas KeywordTokenizerFactory followed, by, say LowerCaseFilter would give you exactly one token: intelligence can't be measured. So searching for measured would get a hit in the first case but not in the second. Searching for intellig* would hit both. Neither is better, just make sure they do what you want! This page will help a lot: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory as will the admin/analysis page. Best Erick On Wed, Jun 1, 2011 at 10:43 AM, Brian Lamb brian.l...@journalexperts.com wrote: Hi Tomás, Thank you very much for your suggestion. I took another crack at it using your recommendation and it worked ideally. The only thing I had to change was analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / /analyzer to analyzer type=query tokenizer class=solr.LowerCaseTokenizerFactory / /analyzer The first did not produce any results but the second worked beautifully. Thanks! Brian Lamb 2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com ...or also use the LowerCaseTokenizerFactory at query time for consistency, but not the edge ngram filter. 2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com Hi Brian, I don't know if I understand what you are trying to achieve. You want the term query abcdefg to have an idf of 1 insead of 7? I think using the KeywordTokenizerFilterFactory at query time should work. I would be something like: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer type=index tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 side=front / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / /analyzer /fieldType this way, at query time abcdefg won't be turned to a ab abc abcd abcde abcdef abcdefg. At index time it will. Regards, Tomás On Tue, May 31, 2011 at 1:07 PM, Brian Lamb brian.l...@journalexperts.com wrote: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 side=front / /analyzer /fieldType I believe I used that link when I initially set up the field and it worked great (and I'm still using it in other places). In this particular example however it does not appear to be practical for me. I mentioned that I have a similarity class that returns 1 for the idf and in the case of an edgengram, it returns 1 * length of the search string. Thanks, Brian Lamb On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com bmdakshinamur...@gmail.com wrote: Can you specify the analyzer you are using for your queries? May be you could use a KeywordAnalyzer for your queries so you don't end up matching parts of your query. http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ This should help you. On Tue, May 31, 2011 at 8:24 PM, Brian Lamb brian.l...@journalexperts.comwrote: In this particular case, I will be doing a solr search based on user preferences. So I will not be depending on the user to type abcdefg. That will be automatically generated based on user selections. The contents of the field do not contain spaces and since I am created the search parameters, case isn't important either. Thanks, Brian Lamb On Tue, May 31, 2011 at 9:44 AM, Erick Erickson erickerick...@gmail.com wrote: That'll work for your case, although be aware that string types aren't analyzed at all, so case matters, as do spaces etc. What is the use-case here? If you explain it a bit there might be better answers Best Erick On Fri, May 27, 2011 at 9:17 AM, Brian Lamb brian.l...@journalexperts.com wrote: For this, I ended up just changing it to string and using abcdefg* to match. That seems to work so far. Thanks, Brian Lamb On Wed, May
Re: Edgengram
That'll work for your case, although be aware that string types aren't analyzed at all, so case matters, as do spaces etc. What is the use-case here? If you explain it a bit there might be better answers Best Erick On Fri, May 27, 2011 at 9:17 AM, Brian Lamb brian.l...@journalexperts.com wrote: For this, I ended up just changing it to string and using abcdefg* to match. That seems to work so far. Thanks, Brian Lamb On Wed, May 25, 2011 at 4:53 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, I'm running into some confusion with the way edgengram works. I have the field set up as: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=100 side=front / /analyzer /fieldType I've also set up my own similarity class that returns 1 as the idf score. What I've found this does is if I match a string abcdefg against a field containing abcdefghijklmnop, then the idf will score that as a 7: 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2) I get why that's happening, but is there a way to avoid that? Do I need to do a new field type to achieve the desired affect? Thanks, Brian Lamb
Re: Edgengram
In this particular case, I will be doing a solr search based on user preferences. So I will not be depending on the user to type abcdefg. That will be automatically generated based on user selections. The contents of the field do not contain spaces and since I am created the search parameters, case isn't important either. Thanks, Brian Lamb On Tue, May 31, 2011 at 9:44 AM, Erick Erickson erickerick...@gmail.comwrote: That'll work for your case, although be aware that string types aren't analyzed at all, so case matters, as do spaces etc. What is the use-case here? If you explain it a bit there might be better answers Best Erick On Fri, May 27, 2011 at 9:17 AM, Brian Lamb brian.l...@journalexperts.com wrote: For this, I ended up just changing it to string and using abcdefg* to match. That seems to work so far. Thanks, Brian Lamb On Wed, May 25, 2011 at 4:53 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, I'm running into some confusion with the way edgengram works. I have the field set up as: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=100 side=front / /analyzer /fieldType I've also set up my own similarity class that returns 1 as the idf score. What I've found this does is if I match a string abcdefg against a field containing abcdefghijklmnop, then the idf will score that as a 7: 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2) I get why that's happening, but is there a way to avoid that? Do I need to do a new field type to achieve the desired affect? Thanks, Brian Lamb
Re: Edgengram
Can you specify the analyzer you are using for your queries? May be you could use a KeywordAnalyzer for your queries so you don't end up matching parts of your query. http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ This should help you. On Tue, May 31, 2011 at 8:24 PM, Brian Lamb brian.l...@journalexperts.comwrote: In this particular case, I will be doing a solr search based on user preferences. So I will not be depending on the user to type abcdefg. That will be automatically generated based on user selections. The contents of the field do not contain spaces and since I am created the search parameters, case isn't important either. Thanks, Brian Lamb On Tue, May 31, 2011 at 9:44 AM, Erick Erickson erickerick...@gmail.com wrote: That'll work for your case, although be aware that string types aren't analyzed at all, so case matters, as do spaces etc. What is the use-case here? If you explain it a bit there might be better answers Best Erick On Fri, May 27, 2011 at 9:17 AM, Brian Lamb brian.l...@journalexperts.com wrote: For this, I ended up just changing it to string and using abcdefg* to match. That seems to work so far. Thanks, Brian Lamb On Wed, May 25, 2011 at 4:53 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, I'm running into some confusion with the way edgengram works. I have the field set up as: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=100 side=front / /analyzer /fieldType I've also set up my own similarity class that returns 1 as the idf score. What I've found this does is if I match a string abcdefg against a field containing abcdefghijklmnop, then the idf will score that as a 7: 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2) I get why that's happening, but is there a way to avoid that? Do I need to do a new field type to achieve the desired affect? Thanks, Brian Lamb -- Thanks and Regards, DakshinaMurthy BM
Re: Edgengram
fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 side=front / /analyzer /fieldType I believe I used that link when I initially set up the field and it worked great (and I'm still using it in other places). In this particular example however it does not appear to be practical for me. I mentioned that I have a similarity class that returns 1 for the idf and in the case of an edgengram, it returns 1 * length of the search string. Thanks, Brian Lamb On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com bmdakshinamur...@gmail.com wrote: Can you specify the analyzer you are using for your queries? May be you could use a KeywordAnalyzer for your queries so you don't end up matching parts of your query. http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ This should help you. On Tue, May 31, 2011 at 8:24 PM, Brian Lamb brian.l...@journalexperts.comwrote: In this particular case, I will be doing a solr search based on user preferences. So I will not be depending on the user to type abcdefg. That will be automatically generated based on user selections. The contents of the field do not contain spaces and since I am created the search parameters, case isn't important either. Thanks, Brian Lamb On Tue, May 31, 2011 at 9:44 AM, Erick Erickson erickerick...@gmail.com wrote: That'll work for your case, although be aware that string types aren't analyzed at all, so case matters, as do spaces etc. What is the use-case here? If you explain it a bit there might be better answers Best Erick On Fri, May 27, 2011 at 9:17 AM, Brian Lamb brian.l...@journalexperts.com wrote: For this, I ended up just changing it to string and using abcdefg* to match. That seems to work so far. Thanks, Brian Lamb On Wed, May 25, 2011 at 4:53 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, I'm running into some confusion with the way edgengram works. I have the field set up as: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=100 side=front / /analyzer /fieldType I've also set up my own similarity class that returns 1 as the idf score. What I've found this does is if I match a string abcdefg against a field containing abcdefghijklmnop, then the idf will score that as a 7: 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2) I get why that's happening, but is there a way to avoid that? Do I need to do a new field type to achieve the desired affect? Thanks, Brian Lamb -- Thanks and Regards, DakshinaMurthy BM
Re: Edgengram
Hi Brian, I don't know if I understand what you are trying to achieve. You want the term query abcdefg to have an idf of 1 insead of 7? I think using the KeywordTokenizerFilterFactory at query time should work. I would be something like: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer type=index tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 side=front / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / /analyzer /fieldType this way, at query time abcdefg won't be turned to a ab abc abcd abcde abcdef abcdefg. At index time it will. Regards, Tomás On Tue, May 31, 2011 at 1:07 PM, Brian Lamb brian.l...@journalexperts.comwrote: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 side=front / /analyzer /fieldType I believe I used that link when I initially set up the field and it worked great (and I'm still using it in other places). In this particular example however it does not appear to be practical for me. I mentioned that I have a similarity class that returns 1 for the idf and in the case of an edgengram, it returns 1 * length of the search string. Thanks, Brian Lamb On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com bmdakshinamur...@gmail.com wrote: Can you specify the analyzer you are using for your queries? May be you could use a KeywordAnalyzer for your queries so you don't end up matching parts of your query. http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ This should help you. On Tue, May 31, 2011 at 8:24 PM, Brian Lamb brian.l...@journalexperts.comwrote: In this particular case, I will be doing a solr search based on user preferences. So I will not be depending on the user to type abcdefg. That will be automatically generated based on user selections. The contents of the field do not contain spaces and since I am created the search parameters, case isn't important either. Thanks, Brian Lamb On Tue, May 31, 2011 at 9:44 AM, Erick Erickson erickerick...@gmail.com wrote: That'll work for your case, although be aware that string types aren't analyzed at all, so case matters, as do spaces etc. What is the use-case here? If you explain it a bit there might be better answers Best Erick On Fri, May 27, 2011 at 9:17 AM, Brian Lamb brian.l...@journalexperts.com wrote: For this, I ended up just changing it to string and using abcdefg* to match. That seems to work so far. Thanks, Brian Lamb On Wed, May 25, 2011 at 4:53 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, I'm running into some confusion with the way edgengram works. I have the field set up as: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=100 side=front / /analyzer /fieldType I've also set up my own similarity class that returns 1 as the idf score. What I've found this does is if I match a string abcdefg against a field containing abcdefghijklmnop, then the idf will score that as a 7: 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2) I get why that's happening, but is there a way to avoid that? Do I need to do a new field type to achieve the desired affect? Thanks, Brian Lamb -- Thanks and Regards, DakshinaMurthy BM
Re: Edgengram
...or also use the LowerCaseTokenizerFactory at query time for consistency, but not the edge ngram filter. 2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com Hi Brian, I don't know if I understand what you are trying to achieve. You want the term query abcdefg to have an idf of 1 insead of 7? I think using the KeywordTokenizerFilterFactory at query time should work. I would be something like: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer type=index tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 side=front / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / /analyzer /fieldType this way, at query time abcdefg won't be turned to a ab abc abcd abcde abcdef abcdefg. At index time it will. Regards, Tomás On Tue, May 31, 2011 at 1:07 PM, Brian Lamb brian.l...@journalexperts.com wrote: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 side=front / /analyzer /fieldType I believe I used that link when I initially set up the field and it worked great (and I'm still using it in other places). In this particular example however it does not appear to be practical for me. I mentioned that I have a similarity class that returns 1 for the idf and in the case of an edgengram, it returns 1 * length of the search string. Thanks, Brian Lamb On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com bmdakshinamur...@gmail.com wrote: Can you specify the analyzer you are using for your queries? May be you could use a KeywordAnalyzer for your queries so you don't end up matching parts of your query. http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ This should help you. On Tue, May 31, 2011 at 8:24 PM, Brian Lamb brian.l...@journalexperts.comwrote: In this particular case, I will be doing a solr search based on user preferences. So I will not be depending on the user to type abcdefg. That will be automatically generated based on user selections. The contents of the field do not contain spaces and since I am created the search parameters, case isn't important either. Thanks, Brian Lamb On Tue, May 31, 2011 at 9:44 AM, Erick Erickson erickerick...@gmail.com wrote: That'll work for your case, although be aware that string types aren't analyzed at all, so case matters, as do spaces etc. What is the use-case here? If you explain it a bit there might be better answers Best Erick On Fri, May 27, 2011 at 9:17 AM, Brian Lamb brian.l...@journalexperts.com wrote: For this, I ended up just changing it to string and using abcdefg* to match. That seems to work so far. Thanks, Brian Lamb On Wed, May 25, 2011 at 4:53 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, I'm running into some confusion with the way edgengram works. I have the field set up as: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=100 side=front / /analyzer /fieldType I've also set up my own similarity class that returns 1 as the idf score. What I've found this does is if I match a string abcdefg against a field containing abcdefghijklmnop, then the idf will score that as a 7: 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2) I get why that's happening, but is there a way to avoid that? Do I need to do a new field type to achieve the desired affect? Thanks, Brian Lamb -- Thanks and Regards, DakshinaMurthy BM
Re: Edgengram
For this, I ended up just changing it to string and using abcdefg* to match. That seems to work so far. Thanks, Brian Lamb On Wed, May 25, 2011 at 4:53 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, I'm running into some confusion with the way edgengram works. I have the field set up as: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=100 side=front / /analyzer /fieldType I've also set up my own similarity class that returns 1 as the idf score. What I've found this does is if I match a string abcdefg against a field containing abcdefghijklmnop, then the idf will score that as a 7: 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2) I get why that's happening, but is there a way to avoid that? Do I need to do a new field type to achieve the desired affect? Thanks, Brian Lamb
Edgengram
Hi all, I'm running into some confusion with the way edgengram works. I have the field set up as: fieldType name=edgengram class=solr.TextField positionIncrementGap=1000 analyzer tokenizer class=solr.LowerCaseTokenizerFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=100 side=front / /analyzer /fieldType I've also set up my own similarity class that returns 1 as the idf score. What I've found this does is if I match a string abcdefg against a field containing abcdefghijklmnop, then the idf will score that as a 7: 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2) I get why that's happening, but is there a way to avoid that? Do I need to do a new field type to achieve the desired affect? Thanks, Brian Lamb
LetterTokenizer + EdgeNGram + apostrophe in query = invalid result
I have the following field defined in my schema: fieldType name=ngram class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.LetterTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.LetterTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType field name=person type=ngram indexed=true stored=true / I have the default field set to person and have indexed the following document: add doc field name=id![CDATA[1001116609]]/field field name=person![CDATA[Vincent M D'Onofrio]]/field /doc /add The following queries return the result as expected using the standard request handler: vincent m d onofrio d'o onofrio d onofrio The following query fails: d'onofrio This is weird because d'o returns a result. As soon as I type the n I start to get no results. I ran this though the field analysis page and it shows that this query is being tokenized correctly and should yield a result. I am using a build of trunk Solr (r1073990) and the example solrconfig.xml. I am also using the example schema with the addition of my ngram field. Any ideas? I have tried this with other word's containing an apostrophe and they all stop returning results after 4 characters. Thanks, Matt Weber
Re: EdgeNgram Auto suggest - doubles ignore
Hi Erick, If you have time, Can you please take a look and provide your comments (or) suggestions for this problem? Please let me know if you need any more information. Thanks, Johnny -- View this message in context: http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2451828.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: EdgeNgram Auto suggest - doubles ignore
I'm afraid I'll have to pass, I'm absolutely swamped at the moment. Perhaps someone else can pick it up. I will say that you should be getting terms back when you pre-lower-case them, so look in your index via the admin page or Luke to see if what's really in your index is what you think in the name field. As for sorting, I haven't a clue. Start by backing out your custom sorting, verifying that things are as you expect for everything *except* sorting and add it back in Best Erick On Tue, Feb 8, 2011 at 10:11 AM, johnnyisrael johnnyi.john...@gmail.comwrote: Hi Erick, If you have time, Can you please take a look and provide your comments (or) suggestions for this problem? Please let me know if you need any more information. Thanks, Johnny -- View this message in context: http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2451828.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: EdgeNgram Auto suggest - doubles ignore
Hi Erick, I tried to use terms component, I got ended up with the following problems. Problem: 1 Custom Sort not working in terms component: http://lucene.472066.n3.nabble.com/Term-component-sort-is-not-working-td1905059.html#a1909386 I want to sort using one of my custom field[value_score], I gave it aleady in my configuration, but it is not sorting properly. The following are the configuration in solrconfig.xml searchComponent name=termsComponent class=org.apache.solr.handler.component.TermsComponent/ requestHandler name=/terms class=org.apache.solr.handler.component.SearchHandler lst name=defaults bool name=termstrue/bool str name=wtjson/str str name=flname/str str name=sortvalue_score desc/str str name=indenttrue/str /lst arr name=components strtermsComponent/str /arr /requestHandler The SOLR response tag is not returned based on sorted parameter. Problem: 2 Cap sensitive problem: [I am searching for Apple] http://localhost/solr/core1/terms?terms.fl=nameterms.prefix=apple -- not working http://localhost/solr/core1/terms?terms.fl=nameterms.prefix=Apple -- working Tried regex to overcome cap-sensitive problem: http://localhost/solr/core1/terms?terms.fl=nameterms.regex=Appleterms.regex.flag=case_insensitive Is this regex based search will help me for my requirement? It is returning irrelevant results. I am using the same syntax it is mentioned in WIKI. http://wiki.apache.org/solr/TermsComponent Am I going wrong anywhere? Please let me know if you need any more info. Thanks, Johnny -- View this message in context: http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2399330.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: EdgeNgram Auto suggest - doubles ignore
Hi Eric, You are right, there is a copy field to EdgeNgram, I tried the configuration but it not working as expected. Configuration I tried: fieldType name=”query” class=”solr.TextField” positionIncrementGap=”100″ termVectors=”true” analyzer type=”index” tokenizer class=”solr.StandardTokenizerFactory”/ filter class=”solr.LowerCaseFilterFactory”/ /analyzer analyzer type=”query” tokenizer class=”solr.StandardTokenizerFactory”/ filter class=”solr.LowerCaseFilterFactory”/ /analyzer /fieldType fieldType name=”edgytext” class=”solr.TextField” positionIncrementGap=”100″ analyzer type=”index” tokenizer class=”solr.WhitespaceTokenizerFactory”/ filter class=”solr.LowerCaseFilterFactory”/ filter class=”solr.EdgeNGramFilterFactory” minGramSize=”3″ maxGramSize=”25″/ /analyzer analyzer type=”query” tokenizer class=”solr.KeywordTokenizerFactory”/ filter class=”solr.LowerCaseFilterFactory”/ /analyzer /fieldType field name=”user_query” type=”query” indexed=”true” stored=”true” omitNorms=”true” omitTermFreqAndPositions=”true” / field name=”edgy_user_query” type=”edgytext” indexed=”true” stored=”true” omitNorms=”true” omitTermFreqAndPositions=”true” / defaultSearchFieldedgy_user_query/defaultSearchField copyField source=”user_query” dest=”edgy_user_query”/ == When I search for the term apple. It is returning results for pineapple vers apple, milk with apple, apple milk shake ... Is there any other way to overcome this problem? Thanks, Johnny -- View this message in context: http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2329370.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: EdgeNgram Auto suggest - doubles ignore
Let's back up here because now I'm not clear what you actually want. EdgeNGrams are a way of matching substrings, which is what's happening here. Of course searching apple against any of the three examples, just as searching for apple without grams would match, that's the expected behavior. So, we need a clear problem definition of what you're trying to do, along with example queries (please post the results of adding debugQuery=on). Best Erick On Tue, Jan 25, 2011 at 8:29 AM, johnnyisrael johnnyi.john...@gmail.comwrote: Hi Eric, You are right, there is a copy field to EdgeNgram, I tried the configuration but it not working as expected. Configuration I tried: fieldType name=”query” class=”solr.TextField” positionIncrementGap=”100″ termVectors=”true” analyzer type=”index” tokenizer class=”solr.StandardTokenizerFactory”/ filter class=”solr.LowerCaseFilterFactory”/ /analyzer analyzer type=”query” tokenizer class=”solr.StandardTokenizerFactory”/ filter class=”solr.LowerCaseFilterFactory”/ /analyzer /fieldType fieldType name=”edgytext” class=”solr.TextField” positionIncrementGap=”100″ analyzer type=”index” tokenizer class=”solr.WhitespaceTokenizerFactory”/ filter class=”solr.LowerCaseFilterFactory”/ filter class=”solr.EdgeNGramFilterFactory” minGramSize=”3″ maxGramSize=”25″/ /analyzer analyzer type=”query” tokenizer class=”solr.KeywordTokenizerFactory”/ filter class=”solr.LowerCaseFilterFactory”/ /analyzer /fieldType field name=”user_query” type=”query” indexed=”true” stored=”true” omitNorms=”true” omitTermFreqAndPositions=”true” / field name=”edgy_user_query” type=”edgytext” indexed=”true” stored=”true” omitNorms=”true” omitTermFreqAndPositions=”true” / defaultSearchFieldedgy_user_query/defaultSearchField copyField source=”user_query” dest=”edgy_user_query”/ == When I search for the term apple. It is returning results for pineapple vers apple, milk with apple, apple milk shake ... Is there any other way to overcome this problem? Thanks, Johnny -- View this message in context: http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2329370.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: EdgeNgram Auto suggest - doubles ignore
Hi Eric, What I want here is, lets say I have 3 documents like [pineapple vers apple, milk with apple, apple milk shake ] and If i search for apple, it should return only apple milk shake because that term alone starts with the letter apple which I typed in. It should not bring others and if I type milk it should return only milk with apple I want an output Similar like a Google auto suggest. Is there a way to achieve this without encapsulating with double quotes. Thanks, Johnny -- View this message in context: http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2333602.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: EdgeNgram Auto suggest - doubles ignore
I haven't figured out any way to achieve that AT ALL without making a seperate Solr index just to serve autosuggest queries. At least when you want to auto-suggest on a multi-value field. Someone posted a crazy tricky way to do it with a single-valued field a while ago. If you can/are willing to make a seperate Solr index with a schema set up for auto-suggest specifically, it's easy. But from an existing schema, where you want to auto-suggest just based on the values in one field, it's a multi-valued field, and you want to allow matches in the middle of the field -- I don't think there's a way to do it. On 1/25/2011 3:03 PM, johnnyisrael wrote: Hi Eric, What I want here is, lets say I have 3 documents like [pineapple vers apple, milk with apple, apple milk shake ] and If i search for apple, it should return only apple milk shake because that term alone starts with the letter apple which I typed in. It should not bring others and if I type milk it should return only milk with apple I want an output Similar like a Google auto suggest. Is there a way to achieve this without encapsulating with double quotes. Thanks, Johnny
Re: EdgeNgram Auto suggest - doubles ignore
Then you don't need NGrams at all. A wildcard will suffice or you can use the TermsComponent. If these strings are indexed as single tokens (KeywordTokenizer with LowercaseFilter) you can simply do field:app* to retrieve the apple milk shake. You can also use the string field type but then you must make sure the values are already lowercased before indexing. Be careful though, there is no query time analysis for wildcard (and fuzzy) queries so make sure Hi Eric, What I want here is, lets say I have 3 documents like [pineapple vers apple, milk with apple, apple milk shake ] and If i search for apple, it should return only apple milk shake because that term alone starts with the letter apple which I typed in. It should not bring others and if I type milk it should return only milk with apple I want an output Similar like a Google auto suggest. Is there a way to achieve this without encapsulating with double quotes. Thanks, Johnny
Re: EdgeNgram Auto suggest - doubles ignore
Oh, i should perhaps mention that EdgeNGrams will yield results a lot quicker than using wildcards at the cost of a larger index. You should, of course, use EdgeNGrams if you worry about performance and have a huge index and a number of queries per second. Then you don't need NGrams at all. A wildcard will suffice or you can use the TermsComponent. If these strings are indexed as single tokens (KeywordTokenizer with LowercaseFilter) you can simply do field:app* to retrieve the apple milk shake. You can also use the string field type but then you must make sure the values are already lowercased before indexing. Be careful though, there is no query time analysis for wildcard (and fuzzy) queries so make sure Hi Eric, What I want here is, lets say I have 3 documents like [pineapple vers apple, milk with apple, apple milk shake ] and If i search for apple, it should return only apple milk shake because that term alone starts with the letter apple which I typed in. It should not bring others and if I type milk it should return only milk with apple I want an output Similar like a Google auto suggest. Is there a way to achieve this without encapsulating with double quotes. Thanks, Johnny
Re: EdgeNgram Auto suggest - doubles ignore
The index contains around 1.5 million documents. As this is used for autosuggest feature, performance is an important factor. So it looks like, using edgeNgram it is difficult to achieve the the following Result should return only those terms where search letter is matching with the first word only. For example, when we type M, it should return Mumford and Sons and not jackson Michael. Jonathan, Is it possible to achieve this when we have separate index using edgeNgram? -- View this message in context: http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2334538.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: EdgeNgram Auto suggest - doubles ignore
Ah, sorry, I got confused about your requirements, if you just want to match at the beginning of the field, it may be more possible. Using edgegrams or wildcard. If you have a single-valued field. Do you have a single-valued or a multi-valued field? That is, does each document have just one value, or multiple? I still get confused about how to do it with edgegrams, even with single-valued field, but I think maybe it's possible. _Definitely_ possible, with or without edgegrams, if you are willing/able to make a completely seperate Solr index where each term for auto-suggest is a document. Yes. The problem lies in what results are. In general, Solr's results are the documents you have in the Solr index. Thus it makes everything a lot easier to deal with if you have an index where each document in the index is a term for auto-suggest. But that doesnt' always meet requirements if you need to auto-suggest within existing fq's and such, and of course it takes more resources to run an additional solr index. On 1/25/2011 5:03 PM, mesenthil wrote: The index contains around 1.5 million documents. As this is used for autosuggest feature, performance is an important factor. So it looks like, using edgeNgram it is difficult to achieve the the following Result should return only those terms where search letter is matching with the first word only. For example, when we type M, it should return Mumford and Sons and not jackson Michael. Jonathan, Is it possible to achieve this when we have separate index using edgeNgram?
Re: EdgeNgram Auto suggest - doubles ignore
Right now our configuration says multivalues=true. But that need not be true in our case. Will make it false and try and update this thread with more details.. -- View this message in context: http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2334627.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: EdgeNgram Auto suggest - doubles ignore
OK, try this. Use some analysis chain for your field like: analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / /analyzer This can be a multiValued field, BTW. now use the TermsComponent to fetch your data. See: http://wiki.apache.org/solr/TermsComponent and specify terms.prefix=apple e.g. http://localhost:8983/solr/terms?terms.prefix=appterms.fl=blivet The return list should be what you want. Note that the returned values will be lower cased, and you can only specify lower case in your search term (all because of specifying the lowercase filter in my example). This should be very fast no matter what your index size, as the return list size defaults to 10 (though you can specify different numbers). Best Erick On Tue, Jan 25, 2011 at 3:03 PM, johnnyisrael johnnyi.john...@gmail.comwrote: Hi Eric, What I want here is, lets say I have 3 documents like [pineapple vers apple, milk with apple, apple milk shake ] and If i search for apple, it should return only apple milk shake because that term alone starts with the letter apple which I typed in. It should not bring others and if I type milk it should return only milk with apple I want an output Similar like a Google auto suggest. Is there a way to achieve this without encapsulating with double quotes. Thanks, Johnny -- View this message in context: http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2333602.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query performance issue while using EdgeNGram
1) Thanks for this update. I have to use 'WhiteSpaceTokenizer' 2) I have to suggest the whole query itself (Say name or title) 3) Could you please let me know if there is a way to find the evicted docs? 4) Yes, we are seeing improvement in the response time if we optimize. But still for some queries QTime is more than 8 secs. It is a 'Blocker' for us. Could you please suggest any to reduce the QTime to 1 secs. -- View this message in context: http://lucene.472066.n3.nabble.com/Query-performance-issue-while-using-EdgeNGram-tp2097056p2130751.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query performance issue while using EdgeNGram
Hmmm. find evicted docs? If you mean find out how many docs are deleted, look on the admin schema browser page and the difference between MaxDoc and NumDocs is the number of deleted documents. You say for some queries the QTime is more than 8 secs. What happens if you re-run that query a bit later? The reason I ask is if you're not warming the cache that that particular query uses, you may be seeing cache loading time here. Look at the admin stats page, especially for evictions. It's also possible that your caches are being reclaimed for some queries and you're seeing response time spikes when the caches are re-loaded. Best Erick On Wed, Dec 22, 2010 at 7:10 AM, Shanmugavel SRD srdshanmuga...@gmail.comwrote: 1) Thanks for this update. I have to use 'WhiteSpaceTokenizer' 2) I have to suggest the whole query itself (Say name or title) 3) Could you please let me know if there is a way to find the evicted docs? 4) Yes, we are seeing improvement in the response time if we optimize. But still for some queries QTime is more than 8 secs. It is a 'Blocker' for us. Could you please suggest any to reduce the QTime to 1 secs. -- View this message in context: http://lucene.472066.n3.nabble.com/Query-performance-issue-while-using-EdgeNGram-tp2097056p2130751.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query performance issue while using EdgeNGram
})(.*)? replacement=$1 replace=all / /analyzer /fieldType /types fields field name=name type=string indexed=false stored=true/ field name=id type=string indexed=true stored=true / field name=score type=sfloat indexed=true stored=false / field name=autosuggest type=typeahead indexed=true stored=false/ /fields uniqueKeyid/uniqueKey defaultSearchFieldautosuggest/defaultSearchField copyField source=name dest=autosuggest/ solrQueryParser defaultOperator=AND/ -- View this message in context: http://lucene.472066.n3.nabble.com/Query-performance-issue-while-using-EdgeNGram-tp2097056p2097056.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: EdgeNGram relevancy
thanks for the explanation. the results for the autocompletion are pretty good now, but we still have a small problem. When there are hits in the edgytext2 fields, results which only have hits in the edgytext field should not be returned at all. Example: Query: Martin Sco Current Results (in that order): - Martin Scorsese - Martin Lawrence - Joseph Martin However, in an autocompletion context, only Martin Scorsese makes sense, the 2 others are logically not correct. I'm not sure if this can be solved on the solr side, or if we should implement the logic in the application. thanks! -robert On Nov 12, 2010, at 12:13 AM, Jonathan Rochkind wrote: Without the parens, the edgytext: only applied to Mr, the default field still applied to Scorcese. The double quotes are neccesary in the second case (rather than parens), because on a non-tokenized field because the standard query parser will pre-tokenize on whitespace before sending individual white-space seperated words to match the index. If the index includes multi-word tokens with internal whitespace, they will never match. But the standard query parser doesn't pre-tokenize like this, it passes the whole phrase to the index intact. Robert Gründler wrote: Did you run your query without using () and operators? If yes can you try this? q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0 I didn't use () and in my query before. Using the query with those operators works now, stopwords are thrown out as the should, thanks. However, i don't understand how the () and operators affect the StopWordFilter. Could you give a brief explanation for the above example? thanks! -robert
Re: EdgeNGram relevancy
it seems adding the '+' (required) operator to each term in a multi-term query does the trick: http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#+ ie: edgytext2:(+Martin +Sco) -robert On Nov 16, 2010, at 8:52 PM, Robert Gründler wrote: thanks for the explanation. the results for the autocompletion are pretty good now, but we still have a small problem. When there are hits in the edgytext2 fields, results which only have hits in the edgytext field should not be returned at all. Example: Query: Martin Sco Current Results (in that order): - Martin Scorsese - Martin Lawrence - Joseph Martin However, in an autocompletion context, only Martin Scorsese makes sense, the 2 others are logically not correct. I'm not sure if this can be solved on the solr side, or if we should implement the logic in the application. thanks! -robert On Nov 12, 2010, at 12:13 AM, Jonathan Rochkind wrote: Without the parens, the edgytext: only applied to Mr, the default field still applied to Scorcese. The double quotes are neccesary in the second case (rather than parens), because on a non-tokenized field because the standard query parser will pre-tokenize on whitespace before sending individual white-space seperated words to match the index. If the index includes multi-word tokens with internal whitespace, they will never match. But the standard query parser doesn't pre-tokenize like this, it passes the whole phrase to the index intact. Robert Gründler wrote: Did you run your query without using () and operators? If yes can you try this? q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0 I didn't use () and in my query before. Using the query with those operators works now, stopwords are thrown out as the should, thanks. However, i don't understand how the () and operators affect the StopWordFilter. Could you give a brief explanation for the above example? thanks! -robert
EdgeNGram relevancy
Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
Re: EdgeNGram relevancy
You can add an additional field, with using KeywordTokenizerFactory instead of WhitespaceTokenizerFactory. And query both these fields with an OR operator. edgytext:(Bill Cl) OR edgytext2:Bill Cl You can even apply boost so that begins with matches comes first. --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote: From: Robert Gründler rob...@dubture.com Subject: EdgeNGram relevancy To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 5:51 PM Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
Re: EdgeNGram relevancy
thanks a lot, that setup works pretty well now. the only problem now is that the StopWords do not work that good anymore. I'll provide an example, but first the 2 fieldtypes: !-- autocomplete field which finds matches inside strings (scor matches Martin Scorsese) -- fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType !-- autocomplete field which finds startsWith matches only (scor matches only Scorpio, but not Martin Scorsese) -- fieldType name=edgytext2 class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This setup now makes troubles regarding StopWords, here's an example: Let's say the index contains 2 Strings: Mr Martin Scorsese and Martin Scorsese. Mr is in the stopword list. Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0 This way, the only result i get is Mr Martin Scorsese, because the strict field edgytext2 is boosted by 2.0. Any idea why in this case Martin Scorsese is not in the result at all? thanks again! -robert On Nov 11, 2010, at 5:57 PM, Ahmet Arslan wrote: You can add an additional field, with using KeywordTokenizerFactory instead of WhitespaceTokenizerFactory. And query both these fields with an OR operator. edgytext:(Bill Cl) OR edgytext2:Bill Cl You can even apply boost so that begins with matches comes first. --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote: From: Robert Gründler rob...@dubture.com Subject: EdgeNGram relevancy To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 5:51 PM Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
Re: EdgeNGram relevancy
This setup now makes troubles regarding StopWords, here's an example: Let's say the index contains 2 Strings: Mr Martin Scorsese and Martin Scorsese. Mr is in the stopword list. Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0 This way, the only result i get is Mr Martin Scorsese, because the strict field edgytext2 is boosted by 2.0. Any idea why in this case Martin Scorsese is not in the result at all? Did you run your query without using () and operators? If yes can you try this? q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0 If no can you paste output of debugQuery=on
Re: EdgeNGram relevancy
On 12 Nov 2010, at 01:46, Ahmet Arslan iori...@yahoo.com wrote: This setup now makes troubles regarding StopWords, here's an example: Let's say the index contains 2 Strings: Mr Martin Scorsese and Martin Scorsese. Mr is in the stopword list. Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0 This way, the only result i get is Mr Martin Scorsese, because the strict field edgytext2 is boosted by 2.0. Any idea why in this case Martin Scorsese is not in the result at all? Did you run your query without using () and operators? If yes can you try this? q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0 If no can you paste output of debugQuery=on This would still not deal with the problem of removing stop words from the indexing and query analysis stages. I really need something that will allow that and give a single token as in the example below. Best Nick
Re: EdgeNGram relevancy
Could anyone help me understand what does Clyde Phillips appear in the results for Bill Cl?? Clyde Phillips doesn't produce any EdgeNGram that would match Bill Cl, so why is it even in the results? Thanks. --- On Thu, 11/11/10, Ahmet Arslan iori...@yahoo.com wrote: You can add an additional field, with using KeywordTokenizerFactory instead of WhitespaceTokenizerFactory. And query both these fields with an OR operator. edgytext:(Bill Cl) OR edgytext2:Bill Cl You can even apply boost so that begins with matches comes first. --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote: From: Robert Gründler rob...@dubture.com Subject: EdgeNGram relevancy To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 5:51 PM Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
Re: EdgeNGram relevancy
according to the fieldtype i posted previously, i think it's because of: 1. WhiteSpaceTokenizer splits the String Clyde Phillips into 2 tokens: Clyde and Phillips 2. EdgeNGramFilter gets the 2 tokens, and creates an EdgeNGram for each token: C Cl Cly ... AND P Ph Phi ... The Query String Bill Cl gets split up in 2 Tokens Bill and Cl by the WhitespaceTokenizer. This creates a match for the 2nd token Ci of the query, and one of the subtokens the EdgeNGramFilter created: Cl. -robert On Nov 11, 2010, at 21:34 , Andy wrote: Could anyone help me understand what does Clyde Phillips appear in the results for Bill Cl?? Clyde Phillips doesn't produce any EdgeNGram that would match Bill Cl, so why is it even in the results? Thanks. --- On Thu, 11/11/10, Ahmet Arslan iori...@yahoo.com wrote: You can add an additional field, with using KeywordTokenizerFactory instead of WhitespaceTokenizerFactory. And query both these fields with an OR operator. edgytext:(Bill Cl) OR edgytext2:Bill Cl You can even apply boost so that begins with matches comes first. --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote: From: Robert Gründler rob...@dubture.com Subject: EdgeNGram relevancy To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 5:51 PM Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
Re: EdgeNGram relevancy
Ah I see. Thanks for the explanation. Could you set the defaultOperator to AND? That way both Bill and Cl must be a match and that would exclude Clyde Phillips. --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote: From: Robert Gründler rob...@dubture.com Subject: Re: EdgeNGram relevancy To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 3:51 PM according to the fieldtype i posted previously, i think it's because of: 1. WhiteSpaceTokenizer splits the String Clyde Phillips into 2 tokens: Clyde and Phillips 2. EdgeNGramFilter gets the 2 tokens, and creates an EdgeNGram for each token: C Cl Cly ... AND P Ph Phi ... The Query String Bill Cl gets split up in 2 Tokens Bill and Cl by the WhitespaceTokenizer. This creates a match for the 2nd token Ci of the query, and one of the subtokens the EdgeNGramFilter created: Cl. -robert On Nov 11, 2010, at 21:34 , Andy wrote: Could anyone help me understand what does Clyde Phillips appear in the results for Bill Cl?? Clyde Phillips doesn't produce any EdgeNGram that would match Bill Cl, so why is it even in the results? Thanks. --- On Thu, 11/11/10, Ahmet Arslan iori...@yahoo.com wrote: You can add an additional field, with using KeywordTokenizerFactory instead of WhitespaceTokenizerFactory. And query both these fields with an OR operator. edgytext:(Bill Cl) OR edgytext2:Bill Cl You can even apply boost so that begins with matches comes first. --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote: From: Robert Gründler rob...@dubture.com Subject: EdgeNGram relevancy To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 5:51 PM Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
Re: EdgeNGram relevancy
Did you run your query without using () and operators? If yes can you try this? q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0 I didn't use () and in my query before. Using the query with those operators works now, stopwords are thrown out as the should, thanks. However, i don't understand how the () and operators affect the StopWordFilter. Could you give a brief explanation for the above example? thanks! -robert
Re: EdgeNGram relevancy
Without the parens, the edgytext: only applied to Mr, the default field still applied to Scorcese. The double quotes are neccesary in the second case (rather than parens), because on a non-tokenized field because the standard query parser will pre-tokenize on whitespace before sending individual white-space seperated words to match the index. If the index includes multi-word tokens with internal whitespace, they will never match. But the standard query parser doesn't pre-tokenize like this, it passes the whole phrase to the index intact. Robert Gründler wrote: Did you run your query without using () and operators? If yes can you try this? q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0 I didn't use () and in my query before. Using the query with those operators works now, stopwords are thrown out as the should, thanks. However, i don't understand how the () and operators affect the StopWordFilter. Could you give a brief explanation for the above example? thanks! -robert