Re: Wildcards and fuzzy/phonetic query
Lowercasing actually seems to work with Wildcard queries, but not with fuzzy queries. Are there any reasons why I should experience such a difference? Hi Haagen, Yonik added this recently. https://issues.apache.org/jira/browse/SOLR-4076
Re: Wildcards and fuzzy/phonetic query
Thank you! I actually tried to look through Jira, but I didn't focus on the minor issues. For me, this is quite critical.. :-) Any chance of merging this into the 4.0.1 release? Regards, Haagen Den 11. des. 2012 kl. 12:45 skrev Ahmet Arslan: Lowercasing actually seems to work with Wildcard queries, but not with fuzzy queries. Are there any reasons why I should experience such a difference? Hi Haagen, Yonik added this recently. https://issues.apache.org/jira/browse/SOLR-4076
Re: Wildcards and fuzzy/phonetic query
It's been two months since I asked about wildcards and phonetic filters, and finally the task of upgrading Solr to version 4.0 was prioritized in our project. So the last couple of days I've been working on it. Another team member upgraded Solr from 3.4 to 4.0, and I've been making changes to schema.xml to accommodate the new multiterm functionality. However, it doesn't seem to work.. Lowercasing is still not done when I do a fuzzy search, not through the regular index analyzer and its support of MultitermAwareComponents, and not when I try to define a special multiterm analyzer. Do I have to do anything special to enable the multiterm functionality in Solr 4.0? Regards, Hågen Den 8. okt. 2012 kl. 18:09 skrev Erick Erickson: whether phonetic filters can be multiterm aware: I'd be leery of this, as I basically don't quite know how that would behave. You'd have to insure that the algorithms changed the first parts of the words uniformly, regardless of what followed. I'm pretty sure that _some_ phonetic algorithms do not follow this pattern, i.e. eric wouldn't necessarily have the same beginning as erickson. That said, some of the algorithms _may_ follow this rule and might be OK candidates for being MultiTermAware But, you don't need this in order to try it out. See the Expert Level Schema Possibilities at: http://searchhub.org/dev/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/ You can define your own analysis chain for wildcards as part of your fieldType definition and include whatever you want, whether or not it's MultiTermAware and it will be applied at query time. Use the analyzer type=query entry as a basis. _But_ you shouldn't include anything in this section that produces more than one output per input token. Note, token, not field. I.e. a really bad candidate for this section is WordDelimiterFilterFactory if you use the admin/analysis page (which you'll get to know intimately) and look at a type that has WordDelimiterFilterFactory in its chain and put something like erickErickson1234, you'll see what I mean.. Make sure and check the verbose box If you can determine that some of the phonetic algorithms _should_ be MultiTermAware, please feel free to raise a JIRA and we can discuss... I suspect it'll be on a case-by-case basis. Best Erick On Mon, Oct 8, 2012 at 11:21 AM, Hågen Pihlstrøm Hasle haagenha...@gmail.com wrote: Hi! I'm quite new to Solr, I was recently asked to help out on a project where the previous Solr-person quit quite suddenly. I've noticed that some of our searches don't return the expected result, and I'm hoping you guys can help me out. We've indexed a lot of names, and would like to search for a person in our system using these names. We previously used Oracle Text for this, and we experience that Solr is much faster. So far so good! :) But when we try to use wildcards things start to to wrong. We're using Solr 3.4, and I see that some of our problems are solved in 3.6. Ref SOLR-2438: https://issues.apache.org/jira/browse/SOLR-2438 But we would also like to be able to combine wildcards with fuzzy searches, and wildcards with a phonetic filter. I don't see anything about phonetic filters in SOLR-2438 or SOLR-2921. (https://issues.apache.org/jira/browse/SOLR-2921) Is it possible to make the phonetic filters MultiTermAware? Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in Solr..) and find both christian and kristian. As far as I understand, this is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined. Is this correct, or have I misunderstood anything? Are there any workarounds or filter-combinations I can use to achieve the same result? I've seen people suggest using a boolean query to combine the two, but I don't really see how that would solve my chr*-problem. As I mentioned earlier I'm quite new to this, so I apologize if what I'm asking about only shows my ignorance.. Regards, Hågen
Re: Wildcards and fuzzy/phonetic query
Lowercasing actually seems to work with Wildcard queries, but not with fuzzy queries. Are there any reasons why I should experience such a difference? Regards, Haagen Den 10. des. 2012 kl. 13:24 skrev Haagen Hasle: It's been two months since I asked about wildcards and phonetic filters, and finally the task of upgrading Solr to version 4.0 was prioritized in our project. So the last couple of days I've been working on it. Another team member upgraded Solr from 3.4 to 4.0, and I've been making changes to schema.xml to accommodate the new multiterm functionality. However, it doesn't seem to work.. Lowercasing is still not done when I do a fuzzy search, not through the regular index analyzer and its support of MultitermAwareComponents, and not when I try to define a special multiterm analyzer. Do I have to do anything special to enable the multiterm functionality in Solr 4.0? Regards, Hågen Den 8. okt. 2012 kl. 18:09 skrev Erick Erickson: whether phonetic filters can be multiterm aware: I'd be leery of this, as I basically don't quite know how that would behave. You'd have to insure that the algorithms changed the first parts of the words uniformly, regardless of what followed. I'm pretty sure that _some_ phonetic algorithms do not follow this pattern, i.e. eric wouldn't necessarily have the same beginning as erickson. That said, some of the algorithms _may_ follow this rule and might be OK candidates for being MultiTermAware But, you don't need this in order to try it out. See the Expert Level Schema Possibilities at: http://searchhub.org/dev/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/ You can define your own analysis chain for wildcards as part of your fieldType definition and include whatever you want, whether or not it's MultiTermAware and it will be applied at query time. Use the analyzer type=query entry as a basis. _But_ you shouldn't include anything in this section that produces more than one output per input token. Note, token, not field. I.e. a really bad candidate for this section is WordDelimiterFilterFactory if you use the admin/analysis page (which you'll get to know intimately) and look at a type that has WordDelimiterFilterFactory in its chain and put something like erickErickson1234, you'll see what I mean.. Make sure and check the verbose box If you can determine that some of the phonetic algorithms _should_ be MultiTermAware, please feel free to raise a JIRA and we can discuss... I suspect it'll be on a case-by-case basis. Best Erick On Mon, Oct 8, 2012 at 11:21 AM, Hågen Pihlstrøm Hasle haagenha...@gmail.com wrote: Hi! I'm quite new to Solr, I was recently asked to help out on a project where the previous Solr-person quit quite suddenly. I've noticed that some of our searches don't return the expected result, and I'm hoping you guys can help me out. We've indexed a lot of names, and would like to search for a person in our system using these names. We previously used Oracle Text for this, and we experience that Solr is much faster. So far so good! :) But when we try to use wildcards things start to to wrong. We're using Solr 3.4, and I see that some of our problems are solved in 3.6. Ref SOLR-2438: https://issues.apache.org/jira/browse/SOLR-2438 But we would also like to be able to combine wildcards with fuzzy searches, and wildcards with a phonetic filter. I don't see anything about phonetic filters in SOLR-2438 or SOLR-2921. (https://issues.apache.org/jira/browse/SOLR-2921) Is it possible to make the phonetic filters MultiTermAware? Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in Solr..) and find both christian and kristian. As far as I understand, this is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined. Is this correct, or have I misunderstood anything? Are there any workarounds or filter-combinations I can use to achieve the same result? I've seen people suggest using a boolean query to combine the two, but I don't really see how that would solve my chr*-problem. As I mentioned earlier I'm quite new to this, so I apologize if what I'm asking about only shows my ignorance.. Regards, Hågen
Re: Wildcards and fuzzy/phonetic query
I used the admin/analysis page (great tip, I had never used it before - thank you!) and it seems to me that the DoubleMetaphone filter converts Hågen to both JN and KN. Will that crash the Solr analysis if I try to include this filter in the multiterm-analysis? Do you know where I can find out more about combining wildcard and fuzzy in the same query? When you say you don't think it is possible, do you mean it is not implemented in Solr today, or it can't be implemented because it is technically impossible or functionally doesn't make sense? :) I wrote in an answer to Otis that I'd like to try to combine fuzzy with Ngram as well. Do you know if that is possible and makes any sense? Thanks to everyone for quick and good answers, I really appreciate it! Regards, Hågen Den 8. okt. 2012 kl. 21:35 skrev Erick Erickson: To answer your first question, yes, you've got it right. If you define a multiterm section in your fieldType, whatever you put in that section gets applied whether the underlying class is MultiTermAware or not. Which means you can shoot yourself in the foot really bad G... (…) Fuzzy searches + wildcards. I don't think you can do that reasonably, but I'm not entirely sure. Best Erick
Re: Wildcards and fuzzy/phonetic query
Hi, Also be sure to check out the new BeiderMorse phonetic: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.BeiderMorseFilterFactory which handles middle eastern and eastern european names very well. Phonetic algorithms use tons of rules for how to transform an input depending on what comes before and after, so I don't believe you'll get wildcard to work perfectly combined with phoenetic since Solr cannot guess what shuould come next. But you may get it to work for many cases, the best is simply to try it out. Use EdgeNgram followed by some phonetic ant try. You may also be interested in a MeetUp talk held in Oslo last month: http://www.meetup.com/Oslo-Solr-Community/events/67253692/ You'll find the link to Mats' talk about Norwegian phonetics if you scroll down that page. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 9. okt. 2012 kl. 11:54 skrev Haagen Hasle haagenha...@gmail.com: I used the admin/analysis page (great tip, I had never used it before - thank you!) and it seems to me that the DoubleMetaphone filter converts Hågen to both JN and KN. Will that crash the Solr analysis if I try to include this filter in the multiterm-analysis? Do you know where I can find out more about combining wildcard and fuzzy in the same query? When you say you don't think it is possible, do you mean it is not implemented in Solr today, or it can't be implemented because it is technically impossible or functionally doesn't make sense? :) I wrote in an answer to Otis that I'd like to try to combine fuzzy with Ngram as well. Do you know if that is possible and makes any sense? Thanks to everyone for quick and good answers, I really appreciate it! Regards, Hågen Den 8. okt. 2012 kl. 21:35 skrev Erick Erickson: To answer your first question, yes, you've got it right. If you define a multiterm section in your fieldType, whatever you put in that section gets applied whether the underlying class is MultiTermAware or not. Which means you can shoot yourself in the foot really bad G... (…) Fuzzy searches + wildcards. I don't think you can do that reasonably, but I'm not entirely sure. Best Erick
Re: Wildcards and fuzzy/phonetic query
It won't crash Solr if you include it, but it probably won't do what you expect either due to how wildcards are expanded. And it gets worse. DoubleMetaphone tries to reduce what it analyzes, well, phonetically with close letters (or multiple choices). Some phonetic filters change to fixed 4 letter combinations as I remember. Some hash to a completely different string. Some About combining fuzzy and wildcard. I haven't thought it through, but it strikes me as fraught with unexpected results. Consider har* and treating it as a fuzzy match. How would you calculate the fuzziness of hardiness and harp? Would you consider her a fuzzy match? How about farther? or even father? You might be able to do something interesting with EdgeNgram here though but it still seems like it's going to either explode computationally or produce results that don't really mean much. But I'm mostly speculating here Frankly, though, I'd do what Jan suggests. Try it out and see if it's good enough. Especially pin down the use cases. Often requirements like this are specified by someone who, when presented with the results of what you can do easily, decide the effort could best be spent somewhere else. Because this whole approach will only increase the number of documents that are found as the result of a search without necessarily increasing the relevance of the top N docs on the first page. Users rarely go to the second page, and often don't even look past the first few results. Doing wildcard AND fuzzy queries would likely result in something useful a very small percentage of the time. But that's just a guess. Best Erick On Tue, Oct 9, 2012 at 5:54 AM, Haagen Hasle haagenha...@gmail.com wrote: I used the admin/analysis page (great tip, I had never used it before - thank you!) and it seems to me that the DoubleMetaphone filter converts Hågen to both JN and KN. Will that crash the Solr analysis if I try to include this filter in the multiterm-analysis? Do you know where I can find out more about combining wildcard and fuzzy in the same query? When you say you don't think it is possible, do you mean it is not implemented in Solr today, or it can't be implemented because it is technically impossible or functionally doesn't make sense? :) I wrote in an answer to Otis that I'd like to try to combine fuzzy with Ngram as well. Do you know if that is possible and makes any sense? Thanks to everyone for quick and good answers, I really appreciate it! Regards, Hågen Den 8. okt. 2012 kl. 21:35 skrev Erick Erickson: To answer your first question, yes, you've got it right. If you define a multiterm section in your fieldType, whatever you put in that section gets applied whether the underlying class is MultiTermAware or not. Which means you can shoot yourself in the foot really bad G... (…) Fuzzy searches + wildcards. I don't think you can do that reasonably, but I'm not entirely sure. Best Erick
Wildcards and fuzzy/phonetic query
Hi! I'm quite new to Solr, I was recently asked to help out on a project where the previous Solr-person quit quite suddenly. I've noticed that some of our searches don't return the expected result, and I'm hoping you guys can help me out. We've indexed a lot of names, and would like to search for a person in our system using these names. We previously used Oracle Text for this, and we experience that Solr is much faster. So far so good! :) But when we try to use wildcards things start to to wrong. We're using Solr 3.4, and I see that some of our problems are solved in 3.6. Ref SOLR-2438: https://issues.apache.org/jira/browse/SOLR-2438 But we would also like to be able to combine wildcards with fuzzy searches, and wildcards with a phonetic filter. I don't see anything about phonetic filters in SOLR-2438 or SOLR-2921. (https://issues.apache.org/jira/browse/SOLR-2921) Is it possible to make the phonetic filters MultiTermAware? Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in Solr..) and find both christian and kristian. As far as I understand, this is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined. Is this correct, or have I misunderstood anything? Are there any workarounds or filter-combinations I can use to achieve the same result? I've seen people suggest using a boolean query to combine the two, but I don't really see how that would solve my chr*-problem. As I mentioned earlier I'm quite new to this, so I apologize if what I'm asking about only shows my ignorance.. Regards, Hågen
Re: Wildcards and fuzzy/phonetic query
A regular expression term may provide what you want, but not exactly. Maybe something like: /(ch|k)r.*/ (No guarantee that will actually work.) See: http://lucene.apache.org/core/4_0_0-BETA/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Regexp_Searches And probably slower than desirable. -- Jack Krupansky -Original Message- From: Hågen Pihlstrøm Hasle Sent: Monday, October 08, 2012 11:21 AM To: solr-user@lucene.apache.org Subject: Wildcards and fuzzy/phonetic query Hi! I'm quite new to Solr, I was recently asked to help out on a project where the previous Solr-person quit quite suddenly. I've noticed that some of our searches don't return the expected result, and I'm hoping you guys can help me out. We've indexed a lot of names, and would like to search for a person in our system using these names. We previously used Oracle Text for this, and we experience that Solr is much faster. So far so good! :) But when we try to use wildcards things start to to wrong. We're using Solr 3.4, and I see that some of our problems are solved in 3.6. Ref SOLR-2438: https://issues.apache.org/jira/browse/SOLR-2438 But we would also like to be able to combine wildcards with fuzzy searches, and wildcards with a phonetic filter. I don't see anything about phonetic filters in SOLR-2438 or SOLR-2921. (https://issues.apache.org/jira/browse/SOLR-2921) Is it possible to make the phonetic filters MultiTermAware? Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in Solr..) and find both christian and kristian. As far as I understand, this is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined. Is this correct, or have I misunderstood anything? Are there any workarounds or filter-combinations I can use to achieve the same result? I've seen people suggest using a boolean query to combine the two, but I don't really see how that would solve my chr*-problem. As I mentioned earlier I'm quite new to this, so I apologize if what I'm asking about only shows my ignorance.. Regards, Hågen=
Re: Wildcards and fuzzy/phonetic query
whether phonetic filters can be multiterm aware: I'd be leery of this, as I basically don't quite know how that would behave. You'd have to insure that the algorithms changed the first parts of the words uniformly, regardless of what followed. I'm pretty sure that _some_ phonetic algorithms do not follow this pattern, i.e. eric wouldn't necessarily have the same beginning as erickson. That said, some of the algorithms _may_ follow this rule and might be OK candidates for being MultiTermAware But, you don't need this in order to try it out. See the Expert Level Schema Possibilities at: http://searchhub.org/dev/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/ You can define your own analysis chain for wildcards as part of your fieldType definition and include whatever you want, whether or not it's MultiTermAware and it will be applied at query time. Use the analyzer type=query entry as a basis. _But_ you shouldn't include anything in this section that produces more than one output per input token. Note, token, not field. I.e. a really bad candidate for this section is WordDelimiterFilterFactory if you use the admin/analysis page (which you'll get to know intimately) and look at a type that has WordDelimiterFilterFactory in its chain and put something like erickErickson1234, you'll see what I mean.. Make sure and check the verbose box If you can determine that some of the phonetic algorithms _should_ be MultiTermAware, please feel free to raise a JIRA and we can discuss... I suspect it'll be on a case-by-case basis. Best Erick On Mon, Oct 8, 2012 at 11:21 AM, Hågen Pihlstrøm Hasle haagenha...@gmail.com wrote: Hi! I'm quite new to Solr, I was recently asked to help out on a project where the previous Solr-person quit quite suddenly. I've noticed that some of our searches don't return the expected result, and I'm hoping you guys can help me out. We've indexed a lot of names, and would like to search for a person in our system using these names. We previously used Oracle Text for this, and we experience that Solr is much faster. So far so good! :) But when we try to use wildcards things start to to wrong. We're using Solr 3.4, and I see that some of our problems are solved in 3.6. Ref SOLR-2438: https://issues.apache.org/jira/browse/SOLR-2438 But we would also like to be able to combine wildcards with fuzzy searches, and wildcards with a phonetic filter. I don't see anything about phonetic filters in SOLR-2438 or SOLR-2921. (https://issues.apache.org/jira/browse/SOLR-2921) Is it possible to make the phonetic filters MultiTermAware? Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in Solr..) and find both christian and kristian. As far as I understand, this is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined. Is this correct, or have I misunderstood anything? Are there any workarounds or filter-combinations I can use to achieve the same result? I've seen people suggest using a boolean query to combine the two, but I don't really see how that would solve my chr*-problem. As I mentioned earlier I'm quite new to this, so I apologize if what I'm asking about only shows my ignorance.. Regards, Hågen
Re: Wildcards and fuzzy/phonetic query
Hi, Consider looking into synonyms and ngrams. Otis -- Performance Monitoring - http://sematext.com/spm On Oct 8, 2012 11:21 AM, Hågen Pihlstrøm Hasle haagenha...@gmail.com wrote: Hi! I'm quite new to Solr, I was recently asked to help out on a project where the previous Solr-person quit quite suddenly. I've noticed that some of our searches don't return the expected result, and I'm hoping you guys can help me out. We've indexed a lot of names, and would like to search for a person in our system using these names. We previously used Oracle Text for this, and we experience that Solr is much faster. So far so good! :) But when we try to use wildcards things start to to wrong. We're using Solr 3.4, and I see that some of our problems are solved in 3.6. Ref SOLR-2438: https://issues.apache.org/jira/browse/SOLR-2438 But we would also like to be able to combine wildcards with fuzzy searches, and wildcards with a phonetic filter. I don't see anything about phonetic filters in SOLR-2438 or SOLR-2921. ( https://issues.apache.org/jira/browse/SOLR-2921) Is it possible to make the phonetic filters MultiTermAware? Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in Solr..) and find both christian and kristian. As far as I understand, this is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined. Is this correct, or have I misunderstood anything? Are there any workarounds or filter-combinations I can use to achieve the same result? I've seen people suggest using a boolean query to combine the two, but I don't really see how that would solve my chr*-problem. As I mentioned earlier I'm quite new to this, so I apologize if what I'm asking about only shows my ignorance.. Regards, Hågen
Re: Wildcards and fuzzy/phonetic query
I guess synonyms would give me a similar result as using regexes, like Jack wrote about. I've thought about that, but I don't think it would be good enough. Substituting k for ch is easy enough, but the problem is that I have to think of every possible substitution in advance. I'd like Fil* to find Phillip, I'd like Hen* to find Hansen, and so on. The possibilities are quite endless, and I can't think of them all. I can't limit myself to Norwegian names either, a lot of people living in Norway have names from other countries. I'd like Moha* to find Mouhammed, etc.. Or am I too pessimistic? I haven't read enough about Ngrams yet, so I'm not sure if I've understood it properly. It divides the word into several pieces and tries to find one or more matches? Would that really help in my Chr* example? I guess you mean the combination of synonyms and ngrams? Is it possible to combine ngrams with a fuzzy query? So that every piece of a word is matched in a fuzzy way? Could that help me? I'll certainly look into ngrams more, thanks for the suggestion. Regards, Hågen On Oct 8, 2012, at 7:23 PM, Otis Gospodnetic wrote: Hi, Consider looking into synonyms and ngrams. Otis -- Performance Monitoring - http://sematext.com/spm On Oct 8, 2012 11:21 AM, Hågen Pihlstrøm Hasle haagenha...@gmail.com wrote: Hi! I'm quite new to Solr, I was recently asked to help out on a project where the previous Solr-person quit quite suddenly. I've noticed that some of our searches don't return the expected result, and I'm hoping you guys can help me out. We've indexed a lot of names, and would like to search for a person in our system using these names. We previously used Oracle Text for this, and we experience that Solr is much faster. So far so good! :) But when we try to use wildcards things start to to wrong. We're using Solr 3.4, and I see that some of our problems are solved in 3.6. Ref SOLR-2438: https://issues.apache.org/jira/browse/SOLR-2438 But we would also like to be able to combine wildcards with fuzzy searches, and wildcards with a phonetic filter. I don't see anything about phonetic filters in SOLR-2438 or SOLR-2921. ( https://issues.apache.org/jira/browse/SOLR-2921) Is it possible to make the phonetic filters MultiTermAware? Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in Solr..) and find both christian and kristian. As far as I understand, this is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined. Is this correct, or have I misunderstood anything? Are there any workarounds or filter-combinations I can use to achieve the same result? I've seen people suggest using a boolean query to combine the two, but I don't really see how that would solve my chr*-problem. As I mentioned earlier I'm quite new to this, so I apologize if what I'm asking about only shows my ignorance.. Regards, Hågen
Re: Wildcards and fuzzy/phonetic query
I understand that I'm quickly reaching the boundaries of my Solr-competence when I'm supposed to read about Expert Level concepts.. :) I had already read it once, but now I read it again. Twice. And I'm not sure if I understand it correctly.. So let me ask a follow-up question: If I define an analyzer of type multiterm, will every filter I include for that analyzer be applied, even if it's not MultiTermAware? To complicate this further, I'm not really sure if phonetic filters is a good match for our needs. We search for names, and these names can come from all over the world. We use DoubleMetaphone, and Wikipedia says it tries to account for myriad irregularities in English of Slavic, Germanic, Celtic, Greek, French, Italian, Spanish, Chinese, and other origin. So I guess it's quite good. But how about names from the middle east, Pakistan or India? Is DoubleMetaphone a good match also for names from these countries? Are there any better algorithms? How about fuzzy-searches and wildcards, are they impossible to combine? We actually do three queries for every search, one fuzzy, one phonetic and one using ngram. Because I don't have too much confidence in the phonetic algorithm, I would really like to be able to combine fuzzy queries with wildcards.. :) Regards, Hågen On Oct 8, 2012, at 6:09 PM, Erick Erickson wrote: whether phonetic filters can be multiterm aware: I'd be leery of this, as I basically don't quite know how that would behave. You'd have to insure that the algorithms changed the first parts of the words uniformly, regardless of what followed. I'm pretty sure that _some_ phonetic algorithms do not follow this pattern, i.e. eric wouldn't necessarily have the same beginning as erickson. That said, some of the algorithms _may_ follow this rule and might be OK candidates for being MultiTermAware But, you don't need this in order to try it out. See the Expert Level Schema Possibilities at: http://searchhub.org/dev/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/ You can define your own analysis chain for wildcards as part of your fieldType definition and include whatever you want, whether or not it's MultiTermAware and it will be applied at query time. Use the analyzer type=query entry as a basis. _But_ you shouldn't include anything in this section that produces more than one output per input token. Note, token, not field. I.e. a really bad candidate for this section is WordDelimiterFilterFactory if you use the admin/analysis page (which you'll get to know intimately) and look at a type that has WordDelimiterFilterFactory in its chain and put something like erickErickson1234, you'll see what I mean.. Make sure and check the verbose box If you can determine that some of the phonetic algorithms _should_ be MultiTermAware, please feel free to raise a JIRA and we can discuss... I suspect it'll be on a case-by-case basis. Best Erick On Mon, Oct 8, 2012 at 11:21 AM, Hågen Pihlstrøm Hasle haagenha...@gmail.com wrote: Hi! I'm quite new to Solr, I was recently asked to help out on a project where the previous Solr-person quit quite suddenly. I've noticed that some of our searches don't return the expected result, and I'm hoping you guys can help me out. We've indexed a lot of names, and would like to search for a person in our system using these names. We previously used Oracle Text for this, and we experience that Solr is much faster. So far so good! :) But when we try to use wildcards things start to to wrong. We're using Solr 3.4, and I see that some of our problems are solved in 3.6. Ref SOLR-2438: https://issues.apache.org/jira/browse/SOLR-2438 But we would also like to be able to combine wildcards with fuzzy searches, and wildcards with a phonetic filter. I don't see anything about phonetic filters in SOLR-2438 or SOLR-2921. (https://issues.apache.org/jira/browse/SOLR-2921) Is it possible to make the phonetic filters MultiTermAware? Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in Solr..) and find both christian and kristian. As far as I understand, this is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined. Is this correct, or have I misunderstood anything? Are there any workarounds or filter-combinations I can use to achieve the same result? I've seen people suggest using a boolean query to combine the two, but I don't really see how that would solve my chr*-problem. As I mentioned earlier I'm quite new to this, so I apologize if what I'm asking about only shows my ignorance.. Regards, Hågen
Re: Wildcards and fuzzy/phonetic query
To answer your first question, yes, you've got it right. If you define a multiterm section in your fieldType, whatever you put in that section gets applied whether the underlying class is MultiTermAware or not. Which means you can shoot yourself in the foot really bad G... Well, you have 6 or so possibilities out of the box...and all of them will fail at times. Fuzzy searches will also fail at times. And so will most anything else you try. The problem is these are algorithmic in nature and there are just too many cases that don't fit, human language is so endlessly variable Whether Middle Eastern names will work well with phonetic filters, well, what's the input language? Are you indexing English (or Norwegian or...) translations? In that case things should work OK since the phonetic variations should be accounted for in the translations. If you're indexing in different languages, you can apply different phonetic filters on different fields, so you might be able to work it that way. But if you're indexing multiple languages in to a _single_ field, you'll have a lot of other problems to solve before you start worrying about phonetics... All I can really say is give it a try and see how well it works since good search results are so domain dependent Fuzzy searches + wildcards. I don't think you can do that reasonably, but I'm not entirely sure. Best Erick On Mon, Oct 8, 2012 at 2:28 PM, Hågen Pihlstrøm Hasle haagenha...@gmail.com wrote: I understand that I'm quickly reaching the boundaries of my Solr-competence when I'm supposed to read about Expert Level concepts.. :) I had already read it once, but now I read it again. Twice. And I'm not sure if I understand it correctly.. So let me ask a follow-up question: If I define an analyzer of type multiterm, will every filter I include for that analyzer be applied, even if it's not MultiTermAware? To complicate this further, I'm not really sure if phonetic filters is a good match for our needs. We search for names, and these names can come from all over the world. We use DoubleMetaphone, and Wikipedia says it tries to account for myriad irregularities in English of Slavic, Germanic, Celtic, Greek, French, Italian, Spanish, Chinese, and other origin. So I guess it's quite good. But how about names from the middle east, Pakistan or India? Is DoubleMetaphone a good match also for names from these countries? Are there any better algorithms? How about fuzzy-searches and wildcards, are they impossible to combine? We actually do three queries for every search, one fuzzy, one phonetic and one using ngram. Because I don't have too much confidence in the phonetic algorithm, I would really like to be able to combine fuzzy queries with wildcards.. :) Regards, Hågen On Oct 8, 2012, at 6:09 PM, Erick Erickson wrote: whether phonetic filters can be multiterm aware: I'd be leery of this, as I basically don't quite know how that would behave. You'd have to insure that the algorithms changed the first parts of the words uniformly, regardless of what followed. I'm pretty sure that _some_ phonetic algorithms do not follow this pattern, i.e. eric wouldn't necessarily have the same beginning as erickson. That said, some of the algorithms _may_ follow this rule and might be OK candidates for being MultiTermAware But, you don't need this in order to try it out. See the Expert Level Schema Possibilities at: http://searchhub.org/dev/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/ You can define your own analysis chain for wildcards as part of your fieldType definition and include whatever you want, whether or not it's MultiTermAware and it will be applied at query time. Use the analyzer type=query entry as a basis. _But_ you shouldn't include anything in this section that produces more than one output per input token. Note, token, not field. I.e. a really bad candidate for this section is WordDelimiterFilterFactory if you use the admin/analysis page (which you'll get to know intimately) and look at a type that has WordDelimiterFilterFactory in its chain and put something like erickErickson1234, you'll see what I mean.. Make sure and check the verbose box If you can determine that some of the phonetic algorithms _should_ be MultiTermAware, please feel free to raise a JIRA and we can discuss... I suspect it'll be on a case-by-case basis. Best Erick On Mon, Oct 8, 2012 at 11:21 AM, Hågen Pihlstrøm Hasle haagenha...@gmail.com wrote: Hi! I'm quite new to Solr, I was recently asked to help out on a project where the previous Solr-person quit quite suddenly. I've noticed that some of our searches don't return the expected result, and I'm hoping you guys can help me out. We've indexed a lot of names, and would like to search for a person in our system using these names. We previously used Oracle Text for this, and we experience that Solr is