Re: Wildcards and fuzzy/phonetic query

2012-12-11 Thread Ahmet Arslan
 Lowercasing actually seems to work with Wildcard queries,
 but not with fuzzy queries.  Are there any reasons why
 I should experience such a difference?

Hi Haagen,

Yonik added this recently. https://issues.apache.org/jira/browse/SOLR-4076



Re: Wildcards and fuzzy/phonetic query

2012-12-11 Thread Haagen Hasle

Thank you!  I actually tried to look through Jira, but I didn't focus on the 
minor issues.  For me, this is quite critical.. :-)

Any chance of merging this into the 4.0.1 release?


Regards, Haagen

Den 11. des. 2012 kl. 12:45 skrev Ahmet Arslan:

 Lowercasing actually seems to work with Wildcard queries,
 but not with fuzzy queries.  Are there any reasons why
 I should experience such a difference?
 
 Hi Haagen,
 
 Yonik added this recently. https://issues.apache.org/jira/browse/SOLR-4076
 



Re: Wildcards and fuzzy/phonetic query

2012-12-10 Thread Haagen Hasle

It's been two months since I asked about wildcards and phonetic filters, and 
finally the task of upgrading Solr to version 4.0 was prioritized in our 
project.  So the last couple of days I've been working on it.  Another team 
member upgraded Solr from 3.4 to 4.0, and I've been making changes to 
schema.xml to accommodate the new multiterm functionality.

However, it doesn't seem to work..  Lowercasing is still not done when I do a 
fuzzy search, not through the regular index analyzer and its support of 
MultitermAwareComponents, and not when I try to define a special multiterm 
analyzer.

Do I have to do anything special to enable the multiterm functionality in Solr 
4.0?


Regards, 

Hågen

Den 8. okt. 2012 kl. 18:09 skrev Erick Erickson:

 whether phonetic filters can be multiterm aware:
 
 I'd be leery of this, as I basically don't quite know how that would
 behave. You'd have to insure that the  algorithms changed the
 first parts of the words uniformly, regardless of what followed. I'm
 pretty sure that _some_ phonetic algorithms do not follow this
 pattern, i.e. eric wouldn't necessarily have the same beginning
 as erickson. That said, some of the algorithms _may_ follow this
 rule and might be OK candidates for being MultiTermAware
 
 But, you don't need this in order to try it out. See the Expert Level
 Schema Possibilities
 at:
 http://searchhub.org/dev/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/
 
 You can define your own analysis chain for wildcards as part of your 
 fieldType
 definition and include whatever you want, whether or not it's
 MultiTermAware and it
 will be applied at query time. Use the analyzer type=query entry
 as a basis. _But_ you shouldn't include anything in this section that
 produces more than one output per input token. Note, token, not
 field. I.e. a really bad candidate for this section is
 WordDelimiterFilterFactory
 if you use the admin/analysis page (which you'll get to know intimately) and
 look at a type that has WordDelimiterFilterFactory in its chain and
 put something
 like erickErickson1234, you'll see what I mean.. Make sure and check the
 verbose box
 
 If you can determine that some of the phonetic algorithms _should_ be
 MultiTermAware, please feel free to raise a JIRA and we can discuss... I 
 suspect
 it'll be on a case-by-case basis.
 
 Best
 Erick
 
 On Mon, Oct 8, 2012 at 11:21 AM, Hågen Pihlstrøm Hasle
 haagenha...@gmail.com wrote:
 Hi!
 
 I'm quite new to Solr, I was recently asked to help out on a project where 
 the previous Solr-person quit quite suddenly.  I've noticed that some of 
 our searches don't return the expected result, and I'm hoping you guys can 
 help me out.
 
 We've indexed a lot of names, and would like to search for a person in our 
 system using these names.  We previously used Oracle Text for this, and we 
 experience that Solr is much faster.  So far so good! :)  But when we try to 
 use wildcards things start to to wrong.
 
 We're using Solr 3.4, and I see that some of our problems are solved in 3.6. 
  Ref SOLR-2438:
 https://issues.apache.org/jira/browse/SOLR-2438
 
 But we would also like to be able to combine wildcards with fuzzy searches, 
 and wildcards with a phonetic filter.  I don't see anything about phonetic 
 filters in SOLR-2438 or SOLR-2921.  
 (https://issues.apache.org/jira/browse/SOLR-2921)
 Is it possible to make the phonetic filters MultiTermAware?
 
 Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in 
 Solr..) and find both christian and kristian.  As far as I understand, this 
 is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined.  
 Is this correct, or have I misunderstood anything?  Are there any 
 workarounds or filter-combinations I can use to achieve the same result?  
 I've seen people suggest using a boolean query to combine the two, but I 
 don't really see how that would solve my chr*-problem.
 
 As I mentioned earlier I'm quite new to this, so I apologize if what I'm 
 asking about only shows my ignorance..
 
 
 Regards, Hågen



Re: Wildcards and fuzzy/phonetic query

2012-12-10 Thread Haagen Hasle

Lowercasing actually seems to work with Wildcard queries, but not with fuzzy 
queries.  Are there any reasons why I should experience such a difference?


Regards, Haagen


Den 10. des. 2012 kl. 13:24 skrev Haagen Hasle:

 
 It's been two months since I asked about wildcards and phonetic filters, and 
 finally the task of upgrading Solr to version 4.0 was prioritized in our 
 project.  So the last couple of days I've been working on it.  Another team 
 member upgraded Solr from 3.4 to 4.0, and I've been making changes to 
 schema.xml to accommodate the new multiterm functionality.
 
 However, it doesn't seem to work..  Lowercasing is still not done when I do a 
 fuzzy search, not through the regular index analyzer and its support of 
 MultitermAwareComponents, and not when I try to define a special multiterm 
 analyzer.
 
 Do I have to do anything special to enable the multiterm functionality in 
 Solr 4.0?
 
 
 Regards, 
 
 Hågen
 
 Den 8. okt. 2012 kl. 18:09 skrev Erick Erickson:
 
 whether phonetic filters can be multiterm aware:
 
 I'd be leery of this, as I basically don't quite know how that would
 behave. You'd have to insure that the  algorithms changed the
 first parts of the words uniformly, regardless of what followed. I'm
 pretty sure that _some_ phonetic algorithms do not follow this
 pattern, i.e. eric wouldn't necessarily have the same beginning
 as erickson. That said, some of the algorithms _may_ follow this
 rule and might be OK candidates for being MultiTermAware
 
 But, you don't need this in order to try it out. See the Expert Level
 Schema Possibilities
 at:
 http://searchhub.org/dev/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/
 
 You can define your own analysis chain for wildcards as part of your 
 fieldType
 definition and include whatever you want, whether or not it's
 MultiTermAware and it
 will be applied at query time. Use the analyzer type=query entry
 as a basis. _But_ you shouldn't include anything in this section that
 produces more than one output per input token. Note, token, not
 field. I.e. a really bad candidate for this section is
 WordDelimiterFilterFactory
 if you use the admin/analysis page (which you'll get to know intimately) and
 look at a type that has WordDelimiterFilterFactory in its chain and
 put something
 like erickErickson1234, you'll see what I mean.. Make sure and check the
 verbose box
 
 If you can determine that some of the phonetic algorithms _should_ be
 MultiTermAware, please feel free to raise a JIRA and we can discuss... I 
 suspect
 it'll be on a case-by-case basis.
 
 Best
 Erick
 
 On Mon, Oct 8, 2012 at 11:21 AM, Hågen Pihlstrøm Hasle
 haagenha...@gmail.com wrote:
 Hi!
 
 I'm quite new to Solr, I was recently asked to help out on a project where 
 the previous Solr-person quit quite suddenly.  I've noticed that some of 
 our searches don't return the expected result, and I'm hoping you guys can 
 help me out.
 
 We've indexed a lot of names, and would like to search for a person in our 
 system using these names.  We previously used Oracle Text for this, and we 
 experience that Solr is much faster.  So far so good! :)  But when we try 
 to use wildcards things start to to wrong.
 
 We're using Solr 3.4, and I see that some of our problems are solved in 
 3.6.  Ref SOLR-2438:
 https://issues.apache.org/jira/browse/SOLR-2438
 
 But we would also like to be able to combine wildcards with fuzzy searches, 
 and wildcards with a phonetic filter.  I don't see anything about phonetic 
 filters in SOLR-2438 or SOLR-2921.  
 (https://issues.apache.org/jira/browse/SOLR-2921)
 Is it possible to make the phonetic filters MultiTermAware?
 
 Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in 
 Solr..) and find both christian and kristian.  As far as I understand, this 
 is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined.  
 Is this correct, or have I misunderstood anything?  Are there any 
 workarounds or filter-combinations I can use to achieve the same result?  
 I've seen people suggest using a boolean query to combine the two, but I 
 don't really see how that would solve my chr*-problem.
 
 As I mentioned earlier I'm quite new to this, so I apologize if what I'm 
 asking about only shows my ignorance..
 
 
 Regards, Hågen
 



Re: Wildcards and fuzzy/phonetic query

2012-10-09 Thread Haagen Hasle

I used the admin/analysis page (great tip, I had never used it before - thank 
you!) and it seems to me that the DoubleMetaphone filter converts Hågen to 
both JN and KN.  Will that crash the Solr analysis if I try to include this 
filter in the multiterm-analysis?

Do you know where I can find out more about combining wildcard and fuzzy in the 
same query?  When you say you don't think it is possible, do you mean it is not 
implemented in Solr today, or it can't be implemented because it is technically 
impossible or functionally doesn't make sense? :)  

I wrote in an answer to Otis that I'd like to try to combine fuzzy with Ngram 
as well.  Do you know if that is possible and makes any sense?


Thanks to everyone for quick and good answers, I really appreciate it!


Regards, Hågen

Den 8. okt. 2012 kl. 21:35 skrev Erick Erickson:

 To answer your first question, yes, you've got it right. If you define
 a multiterm section in your fieldType, whatever you put in that section
 gets applied whether the underlying class is MultiTermAware or not.
 Which means you can shoot yourself in the foot really bad G...
 
 (…)
 
 Fuzzy searches + wildcards. I don't think you can do that reasonably, but
 I'm not entirely sure.
 
 Best
 Erick



Re: Wildcards and fuzzy/phonetic query

2012-10-09 Thread Jan Høydahl
Hi,

Also be sure to check out the new BeiderMorse phonetic: 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.BeiderMorseFilterFactory
 which handles middle eastern and eastern european names very well.

Phonetic algorithms use tons of rules for how to transform an input depending 
on what comes before and after, so I don't believe you'll get wildcard to work 
perfectly combined with phoenetic since Solr cannot guess what shuould come 
next. But you may get it to work for many cases, the best is simply to try it 
out. Use EdgeNgram followed by some phonetic ant try.

You may also be interested in a MeetUp talk held in Oslo last month: 
http://www.meetup.com/Oslo-Solr-Community/events/67253692/ You'll find the link 
to Mats' talk about Norwegian phonetics if you scroll down that page.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

9. okt. 2012 kl. 11:54 skrev Haagen Hasle haagenha...@gmail.com:

 
 I used the admin/analysis page (great tip, I had never used it before - thank 
 you!) and it seems to me that the DoubleMetaphone filter converts Hågen to 
 both JN and KN.  Will that crash the Solr analysis if I try to include 
 this filter in the multiterm-analysis?
 
 Do you know where I can find out more about combining wildcard and fuzzy in 
 the same query?  When you say you don't think it is possible, do you mean it 
 is not implemented in Solr today, or it can't be implemented because it is 
 technically impossible or functionally doesn't make sense? :)  
 
 I wrote in an answer to Otis that I'd like to try to combine fuzzy with Ngram 
 as well.  Do you know if that is possible and makes any sense?
 
 
 Thanks to everyone for quick and good answers, I really appreciate it!
 
 
 Regards, Hågen
 
 Den 8. okt. 2012 kl. 21:35 skrev Erick Erickson:
 
 To answer your first question, yes, you've got it right. If you define
 a multiterm section in your fieldType, whatever you put in that section
 gets applied whether the underlying class is MultiTermAware or not.
 Which means you can shoot yourself in the foot really bad G...
 
 (…)
 
 Fuzzy searches + wildcards. I don't think you can do that reasonably, but
 I'm not entirely sure.
 
 Best
 Erick
 



Re: Wildcards and fuzzy/phonetic query

2012-10-09 Thread Erick Erickson
It won't crash Solr if you include it, but it probably won't do what
you expect either due to how wildcards are expanded.

And it gets worse. DoubleMetaphone tries to reduce what it
analyzes, well, phonetically with close letters (or multiple
choices). Some phonetic filters change to fixed 4 letter
combinations as I remember. Some hash to a completely different
string. Some

About combining fuzzy and wildcard. I haven't thought it through,
but it strikes me as fraught with unexpected results. Consider
har* and treating it as a fuzzy match. How would you calculate
the fuzziness of hardiness and harp? Would you consider
her a fuzzy match? How about farther? or even father?

You might be able to do something interesting with EdgeNgram
here though but it still seems like it's going to either explode
computationally or produce results that don't really mean much.
But I'm mostly speculating here

Frankly, though, I'd do what Jan suggests. Try it out and see if
it's good enough. Especially pin down the use cases. Often
requirements like this are specified by someone who, when
presented with the results of what you can do easily, decide the
effort could best be spent somewhere else.

Because this whole approach will only increase the number of
documents that are found as the result of a search without
necessarily increasing the relevance of the top N docs on the
first page. Users rarely go to the second page, and often don't
even look past the first few results. Doing wildcard AND fuzzy
queries would likely result in something useful  a very small
percentage of the time. But that's just a guess.

Best
Erick


On Tue, Oct 9, 2012 at 5:54 AM, Haagen Hasle haagenha...@gmail.com wrote:

 I used the admin/analysis page (great tip, I had never used it before - thank 
 you!) and it seems to me that the DoubleMetaphone filter converts Hågen to 
 both JN and KN.  Will that crash the Solr analysis if I try to include 
 this filter in the multiterm-analysis?

 Do you know where I can find out more about combining wildcard and fuzzy in 
 the same query?  When you say you don't think it is possible, do you mean it 
 is not implemented in Solr today, or it can't be implemented because it is 
 technically impossible or functionally doesn't make sense? :)

 I wrote in an answer to Otis that I'd like to try to combine fuzzy with Ngram 
 as well.  Do you know if that is possible and makes any sense?


 Thanks to everyone for quick and good answers, I really appreciate it!


 Regards, Hågen

 Den 8. okt. 2012 kl. 21:35 skrev Erick Erickson:

 To answer your first question, yes, you've got it right. If you define
 a multiterm section in your fieldType, whatever you put in that section
 gets applied whether the underlying class is MultiTermAware or not.
 Which means you can shoot yourself in the foot really bad G...

 (…)

 Fuzzy searches + wildcards. I don't think you can do that reasonably, but
 I'm not entirely sure.

 Best
 Erick



Wildcards and fuzzy/phonetic query

2012-10-08 Thread Hågen Pihlstrøm Hasle
Hi!

I'm quite new to Solr, I was recently asked to help out on a project where the 
previous Solr-person quit quite suddenly.  I've noticed that some of our 
searches don't return the expected result, and I'm hoping you guys can help me 
out.

We've indexed a lot of names, and would like to search for a person in our 
system using these names.  We previously used Oracle Text for this, and we 
experience that Solr is much faster.  So far so good! :)  But when we try to 
use wildcards things start to to wrong.

We're using Solr 3.4, and I see that some of our problems are solved in 3.6.  
Ref SOLR-2438:
https://issues.apache.org/jira/browse/SOLR-2438

But we would also like to be able to combine wildcards with fuzzy searches, and 
wildcards with a phonetic filter.  I don't see anything about phonetic filters 
in SOLR-2438 or SOLR-2921.  (https://issues.apache.org/jira/browse/SOLR-2921)  
Is it possible to make the phonetic filters MultiTermAware?

Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in 
Solr..) and find both christian and kristian.  As far as I understand, this is 
not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined.  Is this 
correct, or have I misunderstood anything?  Are there any workarounds or 
filter-combinations I can use to achieve the same result?  I've seen people 
suggest using a boolean query to combine the two, but I don't really see how 
that would solve my chr*-problem.

As I mentioned earlier I'm quite new to this, so I apologize if what I'm asking 
about only shows my ignorance..


Regards, Hågen

Re: Wildcards and fuzzy/phonetic query

2012-10-08 Thread Jack Krupansky
A regular expression term may provide what you want, but not exactly. Maybe 
something like:


/(ch|k)r.*/

(No guarantee that will actually work.)

See:
http://lucene.apache.org/core/4_0_0-BETA/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Regexp_Searches

And probably slower than desirable.

-- Jack Krupansky

-Original Message- 
From: Hågen Pihlstrøm Hasle

Sent: Monday, October 08, 2012 11:21 AM
To: solr-user@lucene.apache.org
Subject: Wildcards and fuzzy/phonetic query

Hi!

I'm quite new to Solr, I was recently asked to help out on a project where 
the previous Solr-person quit quite suddenly.  I've noticed that some of 
our searches don't return the expected result, and I'm hoping you guys can 
help me out.


We've indexed a lot of names, and would like to search for a person in our 
system using these names.  We previously used Oracle Text for this, and we 
experience that Solr is much faster.  So far so good! :)  But when we try to 
use wildcards things start to to wrong.


We're using Solr 3.4, and I see that some of our problems are solved in 3.6. 
Ref SOLR-2438:

https://issues.apache.org/jira/browse/SOLR-2438

But we would also like to be able to combine wildcards with fuzzy searches, 
and wildcards with a phonetic filter.  I don't see anything about phonetic 
filters in SOLR-2438 or SOLR-2921. 
(https://issues.apache.org/jira/browse/SOLR-2921)

Is it possible to make the phonetic filters MultiTermAware?

Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in 
Solr..) and find both christian and kristian.  As far as I understand, this 
is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined. 
Is this correct, or have I misunderstood anything?  Are there any 
workarounds or filter-combinations I can use to achieve the same result? 
I've seen people suggest using a boolean query to combine the two, but I 
don't really see how that would solve my chr*-problem.


As I mentioned earlier I'm quite new to this, so I apologize if what I'm 
asking about only shows my ignorance..



Regards, Hågen= 



Re: Wildcards and fuzzy/phonetic query

2012-10-08 Thread Erick Erickson
whether phonetic filters can be multiterm aware:

I'd be leery of this, as I basically don't quite know how that would
behave. You'd have to insure that the  algorithms changed the
first parts of the words uniformly, regardless of what followed. I'm
pretty sure that _some_ phonetic algorithms do not follow this
pattern, i.e. eric wouldn't necessarily have the same beginning
as erickson. That said, some of the algorithms _may_ follow this
rule and might be OK candidates for being MultiTermAware

But, you don't need this in order to try it out. See the Expert Level
Schema Possibilities
at:
http://searchhub.org/dev/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/

You can define your own analysis chain for wildcards as part of your fieldType
definition and include whatever you want, whether or not it's
MultiTermAware and it
will be applied at query time. Use the analyzer type=query entry
as a basis. _But_ you shouldn't include anything in this section that
produces more than one output per input token. Note, token, not
field. I.e. a really bad candidate for this section is
WordDelimiterFilterFactory
if you use the admin/analysis page (which you'll get to know intimately) and
look at a type that has WordDelimiterFilterFactory in its chain and
put something
like erickErickson1234, you'll see what I mean.. Make sure and check the
verbose box

If you can determine that some of the phonetic algorithms _should_ be
MultiTermAware, please feel free to raise a JIRA and we can discuss... I suspect
it'll be on a case-by-case basis.

Best
Erick

On Mon, Oct 8, 2012 at 11:21 AM, Hågen Pihlstrøm Hasle
haagenha...@gmail.com wrote:
 Hi!

 I'm quite new to Solr, I was recently asked to help out on a project where 
 the previous Solr-person quit quite suddenly.  I've noticed that some of 
 our searches don't return the expected result, and I'm hoping you guys can 
 help me out.

 We've indexed a lot of names, and would like to search for a person in our 
 system using these names.  We previously used Oracle Text for this, and we 
 experience that Solr is much faster.  So far so good! :)  But when we try to 
 use wildcards things start to to wrong.

 We're using Solr 3.4, and I see that some of our problems are solved in 3.6.  
 Ref SOLR-2438:
 https://issues.apache.org/jira/browse/SOLR-2438

 But we would also like to be able to combine wildcards with fuzzy searches, 
 and wildcards with a phonetic filter.  I don't see anything about phonetic 
 filters in SOLR-2438 or SOLR-2921.  
 (https://issues.apache.org/jira/browse/SOLR-2921)
 Is it possible to make the phonetic filters MultiTermAware?

 Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in 
 Solr..) and find both christian and kristian.  As far as I understand, this 
 is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined.  Is 
 this correct, or have I misunderstood anything?  Are there any workarounds or 
 filter-combinations I can use to achieve the same result?  I've seen people 
 suggest using a boolean query to combine the two, but I don't really see how 
 that would solve my chr*-problem.

 As I mentioned earlier I'm quite new to this, so I apologize if what I'm 
 asking about only shows my ignorance..


 Regards, Hågen


Re: Wildcards and fuzzy/phonetic query

2012-10-08 Thread Otis Gospodnetic
Hi,

Consider looking into synonyms and ngrams.

Otis
--
Performance Monitoring - http://sematext.com/spm
On Oct 8, 2012 11:21 AM, Hågen Pihlstrøm Hasle haagenha...@gmail.com
wrote:

 Hi!

 I'm quite new to Solr, I was recently asked to help out on a project where
 the previous Solr-person quit quite suddenly.  I've noticed that some of
 our searches don't return the expected result, and I'm hoping you guys can
 help me out.

 We've indexed a lot of names, and would like to search for a person in our
 system using these names.  We previously used Oracle Text for this, and we
 experience that Solr is much faster.  So far so good! :)  But when we try
 to use wildcards things start to to wrong.

 We're using Solr 3.4, and I see that some of our problems are solved in
 3.6.  Ref SOLR-2438:
 https://issues.apache.org/jira/browse/SOLR-2438

 But we would also like to be able to combine wildcards with fuzzy
 searches, and wildcards with a phonetic filter.  I don't see anything about
 phonetic filters in SOLR-2438 or SOLR-2921.  (
 https://issues.apache.org/jira/browse/SOLR-2921)
 Is it possible to make the phonetic filters MultiTermAware?

 Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in
 Solr..) and find both christian and kristian.  As far as I understand, this
 is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined.
  Is this correct, or have I misunderstood anything?  Are there any
 workarounds or filter-combinations I can use to achieve the same result?
  I've seen people suggest using a boolean query to combine the two, but I
 don't really see how that would solve my chr*-problem.

 As I mentioned earlier I'm quite new to this, so I apologize if what I'm
 asking about only shows my ignorance..


 Regards, Hågen


Re: Wildcards and fuzzy/phonetic query

2012-10-08 Thread Hågen Pihlstrøm Hasle

I guess synonyms would give me a similar result as using regexes, like Jack 
wrote about.  

I've thought about that, but I don't think it would be good enough.  
Substituting k for ch is easy enough, but the problem is that I have to 
think of every possible substitution in advance.  I'd like Fil* to find 
Phillip, I'd like Hen* to find Hansen, and so on.  The possibilities are 
quite endless, and I can't think of them all.  I can't limit myself to 
Norwegian names either, a lot of people living in Norway have names from other 
countries.  I'd like Moha* to find Mouhammed, etc..  Or am I too 
pessimistic?

I haven't read enough about Ngrams yet, so I'm not sure if I've understood it 
properly.  It divides the word into several pieces and tries to find one or 
more matches?  Would that really help in my Chr* example?  I guess you mean 
the combination of synonyms and ngrams?  

Is it possible to combine ngrams with a fuzzy query?  So that every piece of a 
word is matched in a fuzzy way?  Could that help me?

I'll certainly look into ngrams more, thanks for the suggestion.


Regards, Hågen  

On Oct 8, 2012, at 7:23 PM, Otis Gospodnetic wrote:

 Hi,
 
 Consider looking into synonyms and ngrams.
 
 Otis
 --
 Performance Monitoring - http://sematext.com/spm
 On Oct 8, 2012 11:21 AM, Hågen Pihlstrøm Hasle haagenha...@gmail.com
 wrote:
 
 Hi!
 
 I'm quite new to Solr, I was recently asked to help out on a project where
 the previous Solr-person quit quite suddenly.  I've noticed that some of
 our searches don't return the expected result, and I'm hoping you guys can
 help me out.
 
 We've indexed a lot of names, and would like to search for a person in our
 system using these names.  We previously used Oracle Text for this, and we
 experience that Solr is much faster.  So far so good! :)  But when we try
 to use wildcards things start to to wrong.
 
 We're using Solr 3.4, and I see that some of our problems are solved in
 3.6.  Ref SOLR-2438:
 https://issues.apache.org/jira/browse/SOLR-2438
 
 But we would also like to be able to combine wildcards with fuzzy
 searches, and wildcards with a phonetic filter.  I don't see anything about
 phonetic filters in SOLR-2438 or SOLR-2921.  (
 https://issues.apache.org/jira/browse/SOLR-2921)
 Is it possible to make the phonetic filters MultiTermAware?
 
 Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in
 Solr..) and find both christian and kristian.  As far as I understand, this
 is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined.
 Is this correct, or have I misunderstood anything?  Are there any
 workarounds or filter-combinations I can use to achieve the same result?
 I've seen people suggest using a boolean query to combine the two, but I
 don't really see how that would solve my chr*-problem.
 
 As I mentioned earlier I'm quite new to this, so I apologize if what I'm
 asking about only shows my ignorance..
 
 
 Regards, Hågen



Re: Wildcards and fuzzy/phonetic query

2012-10-08 Thread Hågen Pihlstrøm Hasle

I understand that I'm quickly reaching the boundaries of my Solr-competence 
when I'm supposed to read about Expert Level concepts.. :)  I had already 
read it once, but now I read it again. Twice.  And I'm not sure if I understand 
it correctly..  So let me ask a follow-up question:
If I define an analyzer of type multiterm, will every filter I include for that 
analyzer be applied, even if it's not MultiTermAware?

To complicate this further, I'm not really sure if phonetic filters is a good 
match for our needs.  We search for names, and these names can come from all 
over the world.  We use DoubleMetaphone, and Wikipedia says it tries to 
account for myriad irregularities in English of Slavic, Germanic, Celtic, 
Greek, French, Italian, Spanish, Chinese, and other origin.  So I guess it's 
quite good.  But how about names from the middle east, Pakistan or India?  Is 
DoubleMetaphone a good match also for names from these countries?  Are there 
any better algorithms?  

How about fuzzy-searches and wildcards, are they impossible to combine?

We actually do three queries for every search, one fuzzy, one phonetic and one 
using ngram.  Because I don't have too much confidence in the phonetic 
algorithm, I would really like to be able to combine fuzzy queries with 
wildcards.. :)


Regards, Hågen


On Oct 8, 2012, at 6:09 PM, Erick Erickson wrote:

 whether phonetic filters can be multiterm aware:
 
 I'd be leery of this, as I basically don't quite know how that would
 behave. You'd have to insure that the  algorithms changed the
 first parts of the words uniformly, regardless of what followed. I'm
 pretty sure that _some_ phonetic algorithms do not follow this
 pattern, i.e. eric wouldn't necessarily have the same beginning
 as erickson. That said, some of the algorithms _may_ follow this
 rule and might be OK candidates for being MultiTermAware
 
 But, you don't need this in order to try it out. See the Expert Level
 Schema Possibilities
 at:
 http://searchhub.org/dev/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/
 
 You can define your own analysis chain for wildcards as part of your 
 fieldType
 definition and include whatever you want, whether or not it's
 MultiTermAware and it
 will be applied at query time. Use the analyzer type=query entry
 as a basis. _But_ you shouldn't include anything in this section that
 produces more than one output per input token. Note, token, not
 field. I.e. a really bad candidate for this section is
 WordDelimiterFilterFactory
 if you use the admin/analysis page (which you'll get to know intimately) and
 look at a type that has WordDelimiterFilterFactory in its chain and
 put something
 like erickErickson1234, you'll see what I mean.. Make sure and check the
 verbose box
 
 If you can determine that some of the phonetic algorithms _should_ be
 MultiTermAware, please feel free to raise a JIRA and we can discuss... I 
 suspect
 it'll be on a case-by-case basis.
 
 Best
 Erick
 
 On Mon, Oct 8, 2012 at 11:21 AM, Hågen Pihlstrøm Hasle
 haagenha...@gmail.com wrote:
 Hi!
 
 I'm quite new to Solr, I was recently asked to help out on a project where 
 the previous Solr-person quit quite suddenly.  I've noticed that some of 
 our searches don't return the expected result, and I'm hoping you guys can 
 help me out.
 
 We've indexed a lot of names, and would like to search for a person in our 
 system using these names.  We previously used Oracle Text for this, and we 
 experience that Solr is much faster.  So far so good! :)  But when we try to 
 use wildcards things start to to wrong.
 
 We're using Solr 3.4, and I see that some of our problems are solved in 3.6. 
  Ref SOLR-2438:
 https://issues.apache.org/jira/browse/SOLR-2438
 
 But we would also like to be able to combine wildcards with fuzzy searches, 
 and wildcards with a phonetic filter.  I don't see anything about phonetic 
 filters in SOLR-2438 or SOLR-2921.  
 (https://issues.apache.org/jira/browse/SOLR-2921)
 Is it possible to make the phonetic filters MultiTermAware?
 
 Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in 
 Solr..) and find both christian and kristian.  As far as I understand, this 
 is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined.  
 Is this correct, or have I misunderstood anything?  Are there any 
 workarounds or filter-combinations I can use to achieve the same result?  
 I've seen people suggest using a boolean query to combine the two, but I 
 don't really see how that would solve my chr*-problem.
 
 As I mentioned earlier I'm quite new to this, so I apologize if what I'm 
 asking about only shows my ignorance..
 
 
 Regards, Hågen



Re: Wildcards and fuzzy/phonetic query

2012-10-08 Thread Erick Erickson
To answer your first question, yes, you've got it right. If you define
a multiterm section in your fieldType, whatever you put in that section
gets applied whether the underlying class is MultiTermAware or not.
Which means you can shoot yourself in the foot really bad G...

Well, you have 6 or so possibilities out of the box...and all of them will
fail at times. Fuzzy searches will also fail at times. And so will most
anything else you try. The problem is these are algorithmic in nature
and there are just too many cases that don't fit, human language is
so endlessly variable

Whether Middle Eastern names will work well with phonetic filters, well,
what's the input language? Are you indexing English (or Norwegian or...)
translations? In that case things should work OK since the phonetic variations
should be accounted for in the translations.

If you're indexing in different languages, you can apply different
phonetic filters
on different fields, so you might be able to work it that way. But if you're
indexing multiple languages in to a _single_ field, you'll have a lot of other
problems to solve before you start worrying about phonetics...

All I can really say is give it a try and see how well it works since good
search results are so domain dependent

Fuzzy searches + wildcards. I don't think you can do that reasonably, but
I'm not entirely sure.

Best
Erick

On Mon, Oct 8, 2012 at 2:28 PM, Hågen Pihlstrøm Hasle
haagenha...@gmail.com wrote:

 I understand that I'm quickly reaching the boundaries of my Solr-competence 
 when I'm supposed to read about Expert Level concepts.. :)  I had already 
 read it once, but now I read it again. Twice.  And I'm not sure if I 
 understand it correctly..  So let me ask a follow-up question:
 If I define an analyzer of type multiterm, will every filter I include for 
 that analyzer be applied, even if it's not MultiTermAware?

 To complicate this further, I'm not really sure if phonetic filters is a good 
 match for our needs.  We search for names, and these names can come from all 
 over the world.  We use DoubleMetaphone, and Wikipedia says it tries to 
 account for myriad irregularities in English of Slavic, Germanic, Celtic, 
 Greek, French, Italian, Spanish, Chinese, and other origin.  So I guess it's 
 quite good.  But how about names from the middle east, Pakistan or India?  Is 
 DoubleMetaphone a good match also for names from these countries?  Are there 
 any better algorithms?

 How about fuzzy-searches and wildcards, are they impossible to combine?

 We actually do three queries for every search, one fuzzy, one phonetic and 
 one using ngram.  Because I don't have too much confidence in the phonetic 
 algorithm, I would really like to be able to combine fuzzy queries with 
 wildcards.. :)


 Regards, Hågen


 On Oct 8, 2012, at 6:09 PM, Erick Erickson wrote:

 whether phonetic filters can be multiterm aware:

 I'd be leery of this, as I basically don't quite know how that would
 behave. You'd have to insure that the  algorithms changed the
 first parts of the words uniformly, regardless of what followed. I'm
 pretty sure that _some_ phonetic algorithms do not follow this
 pattern, i.e. eric wouldn't necessarily have the same beginning
 as erickson. That said, some of the algorithms _may_ follow this
 rule and might be OK candidates for being MultiTermAware

 But, you don't need this in order to try it out. See the Expert Level
 Schema Possibilities
 at:
 http://searchhub.org/dev/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/

 You can define your own analysis chain for wildcards as part of your 
 fieldType
 definition and include whatever you want, whether or not it's
 MultiTermAware and it
 will be applied at query time. Use the analyzer type=query entry
 as a basis. _But_ you shouldn't include anything in this section that
 produces more than one output per input token. Note, token, not
 field. I.e. a really bad candidate for this section is
 WordDelimiterFilterFactory
 if you use the admin/analysis page (which you'll get to know intimately) and
 look at a type that has WordDelimiterFilterFactory in its chain and
 put something
 like erickErickson1234, you'll see what I mean.. Make sure and check the
 verbose box

 If you can determine that some of the phonetic algorithms _should_ be
 MultiTermAware, please feel free to raise a JIRA and we can discuss... I 
 suspect
 it'll be on a case-by-case basis.

 Best
 Erick

 On Mon, Oct 8, 2012 at 11:21 AM, Hågen Pihlstrøm Hasle
 haagenha...@gmail.com wrote:
 Hi!

 I'm quite new to Solr, I was recently asked to help out on a project where 
 the previous Solr-person quit quite suddenly.  I've noticed that some of 
 our searches don't return the expected result, and I'm hoping you guys can 
 help me out.

 We've indexed a lot of names, and would like to search for a person in our 
 system using these names.  We previously used Oracle Text for this, and we 
 experience that Solr is