RE: Solr, SQL Server's LIKE

2012-01-04 Thread Devon Baumgarten
Great suggestion! Thanks for keeping it simple for a complete Solr newbie.

I'm going to go try this right now.

Thanks!
Devon Baumgarten


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Monday, January 02, 2012 12:30 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr, SQL Server's LIKE

On 12/29/2011 3:51 PM, Devon Baumgarten wrote:
 N-Grams get me pretty great results in general, but I don't want the results 
 for this particular search to be fuzzy. How can I prevent the fuzzy matches 
 from appearing?

 Ex: If I search Albatross I want Albert to be excluded completely, rather 
 than having a low score.

To achieve this while using the ngram filter, just do the ngram analysis 
on the index side, but not on the query side.  If you do this, you'll 
likely need a maxGramSize larger than would normally be required (which 
will make the index larger), and you might need to use the LengthFilter too.

Thanks,
Shawn



Re: Solr, SQL Server's LIKE

2012-01-02 Thread Chantal Ackermann

Thanks, Erick! That sounds great. I really do have to upgrade.

Chantal


On Sun, 2012-01-01 at 16:42 +0100, Erick Erickson wrote:
 Chantal:
 
 bq: The problem with the wildcard searches is that the input is not
 analyzed.
 
 As of 3.6/4.0, this is no longer entirely true. Some analysis is
 performed for wildcard searches by default and you can
 specify most anything you want if you really need to see:
 https://issues.apache.org/jira/browse/SOLR-2438
 and
 http://wiki.apache.org/solr/MultitermQueryAnalysis
 
 Best
 Erick




Re: Solr, SQL Server's LIKE

2012-01-02 Thread Shawn Heisey

On 12/29/2011 3:51 PM, Devon Baumgarten wrote:

N-Grams get me pretty great results in general, but I don't want the results 
for this particular search to be fuzzy. How can I prevent the fuzzy matches 
from appearing?

Ex: If I search Albatross I want Albert to be excluded completely, rather 
than having a low score.


To achieve this while using the ngram filter, just do the ngram analysis 
on the index side, but not on the query side.  If you do this, you'll 
likely need a maxGramSize larger than would normally be required (which 
will make the index larger), and you might need to use the LengthFilter too.


Thanks,
Shawn



Re: Solr, SQL Server's LIKE

2012-01-01 Thread Erick Erickson
Chantal:

bq: The problem with the wildcard searches is that the input is not
analyzed.

As of 3.6/4.0, this is no longer entirely true. Some analysis is
performed for wildcard searches by default and you can
specify most anything you want if you really need to see:
https://issues.apache.org/jira/browse/SOLR-2438
and
http://wiki.apache.org/solr/MultitermQueryAnalysis

Best
Erick

On Fri, Dec 30, 2011 at 4:33 PM, Devon Baumgarten
dbaumgar...@nationalcorp.com wrote:
 Hoss,

 Thanks. You've answered my question. To clarify, what I should have asked for 
 instead of 'exact' was 'not fuzzy'. For some reason it didn't occur to me 
 that I didn't need n-grams to use the wildcard. You asking for me to clarify 
 what I meant made me realize that the n-grams are the source of all my 
 current problems. :)

 Thanks!

 Devon Baumgarten


 -Original Message-
 From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
 Sent: Thursday, December 29, 2011 7:00 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Solr, SQL Server's LIKE


 : Thanks. I know I'll be able to utilize some of Solr's free text
 : searching capabilities in other search types in this project. The
 : product manager wants this particular search to exactly mimic LIKE%.
        ...
 : Ex: If I search Albatross I want Albert to be excluded completely,
 : rather than having a low score.

 please be specific about the types of queries you want. ie: we need more
 then one example of the type of input you want to provide, the type of
 matches you want to see for that input, and the type of matches you want
 to get back.

 in your first message you said you need to match company titles pretty
 exactly but then seem to contradict yourself by saying the SQL's LIKE
 command fit's the bill -- even though the SQL LIKE command exists
 specificly for in-exact matches on field values.

 Based on your one example above of Albatross, you don't need anything
 special: don't use ngrams, don't use stemming, don't use fuzzy anything --
 just search for Albatross and it will match Albatross but not
 Albert.  if you want Albatross to match Albatross Road use some
 basic tokenization.

 If all you really care about is prefix searching (which seems suggested by
 your LIKE% comment above, which i'm guessing is shorthand for something
 similar to LIKE 'ABC%'), so that queries like abc and abcd both
 match abcdef and abcd but neither of them match abcd
 then just use prefix queries (ie: abcd*) -- they should be plenty
 efficient for your purposes.  you only need to worry about ngrams when you
 want to efficiently match in the middle of a string. (ie: TITLE LIKE
 %ABC%)


 -Hoss


RE: Solr, SQL Server's LIKE

2011-12-30 Thread Chantal Ackermann

The problem with the wildcard searches is that the input is not
analyzed. For english, this might not be such a problem (except if you
expect case insenstive search). But than again, you don't get that with
like, either. Ngrams bring that and more.

What I think is often forgotten when comparing 'like' and Solr search
is:
Solr's analyzer allow not only for case insenstive search but also for
other analysis such as removing diacritics and this is also applied when
sorting (you have to create a separate index in the DB, as well, if you
want that).

Say you have the following names:
'Van Hinden'
'van Hinden'
'Música'
'Musil'

like 'mu%' - no hits
like 'Mu%' - 1 hit
like 'van%' - 1 hit
like 'hin%' - no hits

with Solr whitespace or standard tokenizer, ngrams and a diacritcs and
lowercase filter (no wildcard search):
'mu'/'Mu' - 2 hits sorted ignoring case and diacritics
'van' - 2 hits
'hin' - 2 hits


(This is written down from experience. I haven't checked those examples
explicitly.)

Cheers,
Chantal



On Fri, 2011-12-30 at 02:00 +0100, Chris Hostetter wrote:
 : Thanks. I know I'll be able to utilize some of Solr's free text 
 : searching capabilities in other search types in this project. The 
 : product manager wants this particular search to exactly mimic LIKE%.
   ...
 : Ex: If I search Albatross I want Albert to be excluded completely, 
 : rather than having a low score.
 
 please be specific about the types of queries you want. ie: we need more 
 then one example of the type of input you want to provide, the type of 
 matches you want to see for that input, and the type of matches you want 
 to get back.
 
 in your first message you said you need to match company titles pretty 
 exactly but then seem to contradict yourself by saying the SQL's LIKE 
 command fit's the bill -- even though the SQL LIKE command exists 
 specificly for in-exact matches on field values.
 
 Based on your one example above of Albatross, you don't need anything 
 special: don't use ngrams, don't use stemming, don't use fuzzy anything -- 
 just search for Albatross and it will match Albatross but not 
 Albert.  if you want Albatross to match Albatross Road use some 
 basic tokenization.
 
 If all you really care about is prefix searching (which seems suggested by 
 your LIKE% comment above, which i'm guessing is shorthand for something 
 similar to LIKE 'ABC%'), so that queries like abc and abcd both 
 match abcdef and abcd but neither of them match abcd 
 then just use prefix queries (ie: abcd*) -- they should be plenty 
 efficient for your purposes.  you only need to worry about ngrams when you 
 want to efficiently match in the middle of a string. (ie: TITLE LIKE 
 %ABC%)
 
 
 -Hoss



RE: Solr, SQL Server's LIKE

2011-12-30 Thread Devon Baumgarten
Hoss,

Thanks. You've answered my question. To clarify, what I should have asked for 
instead of 'exact' was 'not fuzzy'. For some reason it didn't occur to me that 
I didn't need n-grams to use the wildcard. You asking for me to clarify what I 
meant made me realize that the n-grams are the source of all my current 
problems. :)

Thanks!

Devon Baumgarten


-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Thursday, December 29, 2011 7:00 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr, SQL Server's LIKE


: Thanks. I know I'll be able to utilize some of Solr's free text 
: searching capabilities in other search types in this project. The 
: product manager wants this particular search to exactly mimic LIKE%.
...
: Ex: If I search Albatross I want Albert to be excluded completely, 
: rather than having a low score.

please be specific about the types of queries you want. ie: we need more 
then one example of the type of input you want to provide, the type of 
matches you want to see for that input, and the type of matches you want 
to get back.

in your first message you said you need to match company titles pretty 
exactly but then seem to contradict yourself by saying the SQL's LIKE 
command fit's the bill -- even though the SQL LIKE command exists 
specificly for in-exact matches on field values.

Based on your one example above of Albatross, you don't need anything 
special: don't use ngrams, don't use stemming, don't use fuzzy anything -- 
just search for Albatross and it will match Albatross but not 
Albert.  if you want Albatross to match Albatross Road use some 
basic tokenization.

If all you really care about is prefix searching (which seems suggested by 
your LIKE% comment above, which i'm guessing is shorthand for something 
similar to LIKE 'ABC%'), so that queries like abc and abcd both 
match abcdef and abcd but neither of them match abcd 
then just use prefix queries (ie: abcd*) -- they should be plenty 
efficient for your purposes.  you only need to worry about ngrams when you 
want to efficiently match in the middle of a string. (ie: TITLE LIKE 
%ABC%)


-Hoss


Re: Solr, SQL Server's LIKE

2011-12-29 Thread Shashi Kant
for a simple, hackish (albeit inefficient) approach look up wildcard searchers

e,g foo*, *bar



On Thu, Dec 29, 2011 at 12:38 PM, Devon Baumgarten
dbaumgar...@nationalcorp.com wrote:
 I have been tinkering with Solr for a few weeks, and I am convinced that it 
 could be very helpful in many of my upcoming projects. I am trying to decide 
 whether Solr is appropriate for this one, and I haven't had luck looking for 
 answers on Google.

 I need to search a list of names of companies and individuals pretty exactly. 
 T-SQL's LIKE operator does this with decent performance, but I have a feeling 
 there is a way to configure Solr to do this better. I've tried using an edge 
 N-gram tokenizer, but it feels like it might be more complicated than 
 necessary. What would you suggest?

 I know this sounds kind of 'Golden Hammer,' but there has been talk of other, 
 more complicated (magic) searches that I don't think SQL Server can handle, 
 since its tokens (as far as I know) can't be smaller than one word.

 Thanks,

 Devon Baumgarten



Solr, SQL Server's LIKE

2011-12-29 Thread Devon Baumgarten
I have been tinkering with Solr for a few weeks, and I am convinced that it 
could be very helpful in many of my upcoming projects. I am trying to decide 
whether Solr is appropriate for this one, and I haven't had luck looking for 
answers on Google.

I need to search a list of names of companies and individuals pretty exactly. 
T-SQL's LIKE operator does this with decent performance, but I have a feeling 
there is a way to configure Solr to do this better. I've tried using an edge 
N-gram tokenizer, but it feels like it might be more complicated than 
necessary. What would you suggest?

I know this sounds kind of 'Golden Hammer,' but there has been talk of other, 
more complicated (magic) searches that I don't think SQL Server can handle, 
since its tokens (as far as I know) can't be smaller than one word.

Thanks,

Devon Baumgarten



Re: Solr, SQL Server's LIKE

2011-12-29 Thread Erick Erickson
SQLs like is usually handled with ngrams if you want
*stuff* kinds of searches. Wildcards are interesting
in Solr.

Things Solr handles that aren't easy in SQL
Phrases, phrases with slop, stemming,
synonyms. And, especially, some kind
of relevance ranking.

But Solr does NOT do the things SQL is best at,
things like joins etc. Each has it's sweet spot
and trying to make one do all the functions of the
other is fraught with places to go wrong.

Not a lot of help, but free text searching is what Solr is
all about, so if your problem maps into that space,
it's a great tool!

Best
Erick

On Thu, Dec 29, 2011 at 1:06 PM, Shashi Kant sk...@sloan.mit.edu wrote:
 for a simple, hackish (albeit inefficient) approach look up wildcard searchers

 e,g foo*, *bar



 On Thu, Dec 29, 2011 at 12:38 PM, Devon Baumgarten
 dbaumgar...@nationalcorp.com wrote:
 I have been tinkering with Solr for a few weeks, and I am convinced that it 
 could be very helpful in many of my upcoming projects. I am trying to decide 
 whether Solr is appropriate for this one, and I haven't had luck looking for 
 answers on Google.

 I need to search a list of names of companies and individuals pretty 
 exactly. T-SQL's LIKE operator does this with decent performance, but I have 
 a feeling there is a way to configure Solr to do this better. I've tried 
 using an edge N-gram tokenizer, but it feels like it might be more 
 complicated than necessary. What would you suggest?

 I know this sounds kind of 'Golden Hammer,' but there has been talk of 
 other, more complicated (magic) searches that I don't think SQL Server can 
 handle, since its tokens (as far as I know) can't be smaller than one word.

 Thanks,

 Devon Baumgarten



RE: Solr, SQL Server's LIKE

2011-12-29 Thread Devon Baumgarten
Erick,

Thanks. I know I'll be able to utilize some of Solr's free text searching 
capabilities in other search types in this project. The product manager wants 
this particular search to exactly mimic LIKE%.

N-Grams get me pretty great results in general, but I don't want the results 
for this particular search to be fuzzy. How can I prevent the fuzzy matches 
from appearing?

Ex: If I search Albatross I want Albert to be excluded completely, rather 
than having a low score.

Devon Baumgarten


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, December 29, 2011 3:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr, SQL Server's LIKE

SQLs like is usually handled with ngrams if you want
*stuff* kinds of searches. Wildcards are interesting
in Solr.

Things Solr handles that aren't easy in SQL
Phrases, phrases with slop, stemming,
synonyms. And, especially, some kind
of relevance ranking.

But Solr does NOT do the things SQL is best at,
things like joins etc. Each has it's sweet spot
and trying to make one do all the functions of the
other is fraught with places to go wrong.

Not a lot of help, but free text searching is what Solr is
all about, so if your problem maps into that space,
it's a great tool!

Best
Erick

On Thu, Dec 29, 2011 at 1:06 PM, Shashi Kant sk...@sloan.mit.edu wrote:
 for a simple, hackish (albeit inefficient) approach look up wildcard searchers

 e,g foo*, *bar



 On Thu, Dec 29, 2011 at 12:38 PM, Devon Baumgarten
 dbaumgar...@nationalcorp.com wrote:
 I have been tinkering with Solr for a few weeks, and I am convinced that it 
 could be very helpful in many of my upcoming projects. I am trying to decide 
 whether Solr is appropriate for this one, and I haven't had luck looking for 
 answers on Google.

 I need to search a list of names of companies and individuals pretty 
 exactly. T-SQL's LIKE operator does this with decent performance, but I have 
 a feeling there is a way to configure Solr to do this better. I've tried 
 using an edge N-gram tokenizer, but it feels like it might be more 
 complicated than necessary. What would you suggest?

 I know this sounds kind of 'Golden Hammer,' but there has been talk of 
 other, more complicated (magic) searches that I don't think SQL Server can 
 handle, since its tokens (as far as I know) can't be smaller than one word.

 Thanks,

 Devon Baumgarten



Re: Solr, SQL Server's LIKE

2011-12-29 Thread Sujit Pal
Hi Devon,

Have you considered using a permuterm index? Its workable, but depending
on your requirements (size of fields that you want to create the index
on), it may bloat your index. I've written about it here:
http://sujitpal.blogspot.com/2011/10/lucene-wildcard-query-and-permuterm.html 

Another alternative which I've implemented is a custom mechanism that
retrieves a list of matching unique ids from a database table using a
SQL LIKE, then passes this list as a filter to the main query. Its
hacky, but I was building a custom handler anyway, so it was quite
simple to add in.

-sujit

On Thu, 2011-12-29 at 11:38 -0600, Devon Baumgarten wrote:
 I have been tinkering with Solr for a few weeks, and I am convinced that it 
 could be very helpful in many of my upcoming projects. I am trying to decide 
 whether Solr is appropriate for this one, and I haven't had luck looking for 
 answers on Google.
 
 I need to search a list of names of companies and individuals pretty exactly. 
 T-SQL's LIKE operator does this with decent performance, but I have a feeling 
 there is a way to configure Solr to do this better. I've tried using an edge 
 N-gram tokenizer, but it feels like it might be more complicated than 
 necessary. What would you suggest?
 
 I know this sounds kind of 'Golden Hammer,' but there has been talk of other, 
 more complicated (magic) searches that I don't think SQL Server can handle, 
 since its tokens (as far as I know) can't be smaller than one word.
 
 Thanks,
 
 Devon Baumgarten
 



RE: Solr, SQL Server's LIKE

2011-12-29 Thread Chris Hostetter

: Thanks. I know I'll be able to utilize some of Solr's free text 
: searching capabilities in other search types in this project. The 
: product manager wants this particular search to exactly mimic LIKE%.
...
: Ex: If I search Albatross I want Albert to be excluded completely, 
: rather than having a low score.

please be specific about the types of queries you want. ie: we need more 
then one example of the type of input you want to provide, the type of 
matches you want to see for that input, and the type of matches you want 
to get back.

in your first message you said you need to match company titles pretty 
exactly but then seem to contradict yourself by saying the SQL's LIKE 
command fit's the bill -- even though the SQL LIKE command exists 
specificly for in-exact matches on field values.

Based on your one example above of Albatross, you don't need anything 
special: don't use ngrams, don't use stemming, don't use fuzzy anything -- 
just search for Albatross and it will match Albatross but not 
Albert.  if you want Albatross to match Albatross Road use some 
basic tokenization.

If all you really care about is prefix searching (which seems suggested by 
your LIKE% comment above, which i'm guessing is shorthand for something 
similar to LIKE 'ABC%'), so that queries like abc and abcd both 
match abcdef and abcd but neither of them match abcd 
then just use prefix queries (ie: abcd*) -- they should be plenty 
efficient for your purposes.  you only need to worry about ngrams when you 
want to efficiently match in the middle of a string. (ie: TITLE LIKE 
%ABC%)


-Hoss