RE: Solr Wildcard Search

2017-11-30 Thread Allison, Timothy B.
A slightly more refined answer...  In my experience with the systems I've 
worked with, Porter and other stemmers can be useful as a "fallback field" with 
a really low boost, but you should be really careful if you're only searching 
on one field.

Cannot recommend Doug Turnbull and John Berryman's "Relevant Search" enough on 
how to layer fields...among many other great insights: 
https://www.manning.com/books/relevant-search


 -Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Thursday, November 30, 2017 9:20 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Wildcard Search

At the very least the English possessive filter, which you have.  Great!

Depending on what your query log analysis finds -- perhaps users are pretty 
much only searching on nouns? -- you might consider 
EnglishMinimalStemFilterFactory.

I wouldn't say that porter was or wasn't chosen intentionally.  It may be good 
for some use cases.  However, for the use cases I've seen, it has been 
disastrous.   

I have code that shows "equivalence sets" for analysis chain A vs analysis 
chain B...with some noise...assume same tokenization...  I should probably 
share that code on github or fold it into Luke somehow?  You can see this on a 
one-off basis in the Solr admin window via the Analysis tab, but to see this on 
your corpus/corpora across terms can be eye-opening, and then to cross-check it 
against query logs...quite powerful.


On one corpus, when I compared the same analysis chain A without Porter and B 
with porter, the output is e.g.:

"stemmed\tunstemmed #docs|unstemmed #docs..."

public  public 9834 | publication 1429 | publications 960 | publicly 662 | 
public's 176 | publicize 118 | publicized 107 | publicity 91 | publically 66 | 
publicizing 63 | publication's 6 | publicizes 4 | public_ 1 | publication_ 1 | 
publiced 1

effect  effective 6329 | effect 3157 | effectively 1745 | effectiveness 1198 | 
effects 831 | effected 139 | effecting 85 | effectives 1

new new 13279 | newness 6 | newed 3 | newe 2 | newing 1

order   order 7256 | orders 3125 | ordered 1840 | ordering 758 | orderly 241 | 
order's 17 | orderable 3 | orders_ 1

Imagine users searching for "publication" (~2500 docs) and getting back every 
document that mentions "public" (~10k).  That's a huge problem in many 
circumstances.  Good luck finding the name "newing".


-Original Message-
From: Georgy Nevsky [mailto:gnevsky.cn...@thomasnet.com]
Sent: Thursday, November 30, 2017 8:31 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Wildcard Search

I understand stemming reason. Thank you.

What do you suggest to use for stemming instead of "Porter" ? I guess, it 
wasn't chosen intentionally.

In the best we trust
Georgy Nevsky


-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, November 30, 2017 8:25 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Wildcard Search

The initial question wasn't about a phrasal search, but I largely agree that 
diff q parsers handle the analysis chain differently for multiterms.

Yes, Porter is crazily aggressive. USE WITH CAUTION!

As has been pointed out, use the Solr admin window and the "debug" in the query 
option to see what's going on.

Use the Solr admin Analysis feature to see how your tokens are being modified 
by each step in the analysis chain.

If you use solr admin and debug the query for "shipping", you see that it is 
stemmed to "ship"...hence all of your matches work.  Porter doesn't have rules 
for words ending in "pp", so it doesn't stem "shipp" to "ship".  So, your 
wildcard query is looking for words that start with "shipp", and given that 
"shipping" was stemmed to "ship", it won't find it.  It would find "shippqrs" 
because porter wouldn't know what to do with that 

Again, Porter can be very dangerous if it doesn't align with user expectations.



-----Original Message-
From: Atita Arora [mailto:atitaar...@gmail.com]
Sent: Thursday, November 30, 2017 8:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Wildcard Search

As Rick raised the most important aspect here , that the phrase is broken into 
multiple terms ORed together , I believe if the use case requires to perform 
wildcard search on phrases , we would need to store the entire phrase as a 
single term in the index which probably is not happening right now and hence 
are not found when sent across as phrases.
I tried this on my local Solr 7.1 without phrase this works as expected , 
however as soon as I do phrase search it fails for the reason as i mentioned 
above.

Let me know if I can clarify further.

On Thu, Nov 30, 2017 at 6:31 PM, Georgy Nevsky <gnevsky.cn...@thomasnet.com>
wrote:

> I wish to understand if I can do something to get in result 

RE: Solr Wildcard Search

2017-11-30 Thread Allison, Timothy B.
At the very least the English possessive filter, which you have.  Great!

Depending on what your query log analysis finds -- perhaps users are pretty 
much only searching on nouns? -- you might consider 
EnglishMinimalStemFilterFactory.

I wouldn't say that porter was or wasn't chosen intentionally.  It may be good 
for some use cases.  However, for the use cases I've seen, it has been 
disastrous.   

I have code that shows "equivalence sets" for analysis chain A vs analysis 
chain B...with some noise...assume same tokenization...  I should probably 
share that code on github or fold it into Luke somehow?  You can see this on a 
one-off basis in the Solr admin window via the Analysis tab, but to see this on 
your corpus/corpora across terms can be eye-opening, and then to cross-check it 
against query logs...quite powerful.


On one corpus, when I compared the same analysis chain A without Porter and B 
with porter, the output is e.g.:

"stemmed\tunstemmed #docs|unstemmed #docs..."

public  public 9834 | publication 1429 | publications 960 | publicly 662 | 
public's 176 | publicize 118 | publicized 107 | publicity 91 | publically 66 | 
publicizing 63 | publication's 6 | publicizes 4 | public_ 1 | publication_ 1 | 
publiced 1

effect  effective 6329 | effect 3157 | effectively 1745 | effectiveness 1198 | 
effects 831 | effected 139 | effecting 85 | effectives 1

new new 13279 | newness 6 | newed 3 | newe 2 | newing 1

order   order 7256 | orders 3125 | ordered 1840 | ordering 758 | orderly 241 | 
order's 17 | orderable 3 | orders_ 1

Imagine users searching for "publication" (~2500 docs) and getting back every 
document that mentions "public" (~10k).  That's a huge problem in many 
circumstances.  Good luck finding the name "newing".


-Original Message-
From: Georgy Nevsky [mailto:gnevsky.cn...@thomasnet.com] 
Sent: Thursday, November 30, 2017 8:31 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Wildcard Search

I understand stemming reason. Thank you.

What do you suggest to use for stemming instead of "Porter" ? I guess, it 
wasn't chosen intentionally.

In the best we trust
Georgy Nevsky


-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, November 30, 2017 8:25 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Wildcard Search

The initial question wasn't about a phrasal search, but I largely agree that 
diff q parsers handle the analysis chain differently for multiterms.

Yes, Porter is crazily aggressive. USE WITH CAUTION!

As has been pointed out, use the Solr admin window and the "debug" in the query 
option to see what's going on.

Use the Solr admin Analysis feature to see how your tokens are being modified 
by each step in the analysis chain.

If you use solr admin and debug the query for "shipping", you see that it is 
stemmed to "ship"...hence all of your matches work.  Porter doesn't have rules 
for words ending in "pp", so it doesn't stem "shipp" to "ship".  So, your 
wildcard query is looking for words that start with "shipp", and given that 
"shipping" was stemmed to "ship", it won't find it.  It would find "shippqrs" 
because porter wouldn't know what to do with that 

Again, Porter can be very dangerous if it doesn't align with user expectations.



-----Original Message-
From: Atita Arora [mailto:atitaar...@gmail.com]
Sent: Thursday, November 30, 2017 8:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Wildcard Search

As Rick raised the most important aspect here , that the phrase is broken into 
multiple terms ORed together , I believe if the use case requires to perform 
wildcard search on phrases , we would need to store the entire phrase as a 
single term in the index which probably is not happening right now and hence 
are not found when sent across as phrases.
I tried this on my local Solr 7.1 without phrase this works as expected , 
however as soon as I do phrase search it fails for the reason as i mentioned 
above.

Let me know if I can clarify further.

On Thu, Nov 30, 2017 at 6:31 PM, Georgy Nevsky <gnevsky.cn...@thomasnet.com>
wrote:

> I wish to understand if I can do something to get in result term 
> "shipping"
> when search for "shipp*"?
>
> Here field definition:
>  multiValued="false"/>
>
>  positionIncrementGap="100">
>   
> 
>  ignoreCase="true"
> words="lang/stopwords_en.txt"
> />
> 
> 
>  protected="protwords.txt"/>
> 
>   
>
> Anything else can be important? Most configuration parameters are 
> default to Apache Solr 7.1.0.
>
> In the best we trust
> Georgy Nevsky
>
>

RE: Solr Wildcard Search

2017-11-30 Thread Georgy Nevsky
I understand stemming reason. Thank you.

What do you suggest to use for stemming instead of "Porter" ? I guess, it
wasn't chosen intentionally.

In the best we trust
Georgy Nevsky


-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, November 30, 2017 8:25 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Wildcard Search

The initial question wasn't about a phrasal search, but I largely agree that
diff q parsers handle the analysis chain differently for multiterms.

Yes, Porter is crazily aggressive. USE WITH CAUTION!

As has been pointed out, use the Solr admin window and the "debug" in the
query option to see what's going on.

Use the Solr admin Analysis feature to see how your tokens are being
modified by each step in the analysis chain.

If you use solr admin and debug the query for "shipping", you see that it is
stemmed to "ship"...hence all of your matches work.  Porter doesn't have
rules for words ending in "pp", so it doesn't stem "shipp" to "ship".  So,
your wildcard query is looking for words that start with "shipp", and given
that "shipping" was stemmed to "ship", it won't find it.  It would find
"shippqrs" because porter wouldn't know what to do with that 

Again, Porter can be very dangerous if it doesn't align with user
expectations.



-Original Message-
From: Atita Arora [mailto:atitaar...@gmail.com]
Sent: Thursday, November 30, 2017 8:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Wildcard Search

As Rick raised the most important aspect here , that the phrase is broken
into multiple terms ORed together , I believe if the use case requires to
perform wildcard search on phrases , we would need to store the entire
phrase as a single term in the index which probably is not happening right
now and hence are not found when sent across as phrases.
I tried this on my local Solr 7.1 without phrase this works as expected ,
however as soon as I do phrase search it fails for the reason as i mentioned
above.

Let me know if I can clarify further.

On Thu, Nov 30, 2017 at 6:31 PM, Georgy Nevsky <gnevsky.cn...@thomasnet.com>
wrote:

> I wish to understand if I can do something to get in result term
> "shipping"
> when search for "shipp*"?
>
> Here field definition:
>  multiValued="false"/>
>
>  positionIncrementGap="100">
>   
> 
>  ignoreCase="true"
> words="lang/stopwords_en.txt"
> />
> 
> 
>  protected="protwords.txt"/>
> 
>   
>
> Anything else can be important? Most configuration parameters are
> default to Apache Solr 7.1.0.
>
> In the best we trust
> Georgy Nevsky
>
>
> -Original Message-
> From: Rick Leir [mailto:rl...@leirtech.com]
> Sent: Thursday, November 30, 2017 7:32 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Wildcard Search
>
> George,
> When you get those results it could be due to stemming.
>
> Wildcard processing expands your term to multiple terms, OR'd
> together. It also takes you down a different analysis pathway, as many
> analysis components do not work with multiple terms. Look into the
> SolrAdmin console, and use the analysis tab to understand what is
> going on.
>
> If you still have doubts, tell us more about your config.
> Cheers --Rick
>
>
> On November 30, 2017 7:06:42 AM EST, Georgy Nevsky
> <gnevsky.cn...@thomasnet.com> wrote:
> >Can somebody help me understand how Solr Wildcard Search is working?
> >
> >If I’m doing search for “ship*” term I’m getting in result many
> >strings, like “Shipping Weight”, “Ship From”, “Shipping Calculator”,
> >etc.
> >
> >But if I’m searching for “shipp*” I don’t get any result.
> >
> >
> >
> >In the best we trust
> >
> >Georgy Nevsky
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com
>


RE: Solr Wildcard Search

2017-11-30 Thread Allison, Timothy B.
The initial question wasn't about a phrasal search, but I largely agree that 
diff q parsers handle the analysis chain differently for multiterms.

Yes, Porter is crazily aggressive. USE WITH CAUTION!  

As has been pointed out, use the Solr admin window and the "debug" in the query 
option to see what's going on.

Use the Solr admin Analysis feature to see how your tokens are being modified 
by each step in the analysis chain.

If you use solr admin and debug the query for "shipping", you see that it is 
stemmed to "ship"...hence all of your matches work.  Porter doesn't have rules 
for words ending in "pp", so it doesn't stem "shipp" to "ship".  So, your 
wildcard query is looking for words that start with "shipp", and given that 
"shipping" was stemmed to "ship", it won't find it.  It would find "shippqrs" 
because porter wouldn't know what to do with that 

Again, Porter can be very dangerous if it doesn't align with user expectations.



-Original Message-
From: Atita Arora [mailto:atitaar...@gmail.com] 
Sent: Thursday, November 30, 2017 8:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Wildcard Search

As Rick raised the most important aspect here , that the phrase is broken into 
multiple terms ORed together , I believe if the use case requires to perform 
wildcard search on phrases , we would need to store the entire phrase as a 
single term in the index which probably is not happening right now and hence 
are not found when sent across as phrases.
I tried this on my local Solr 7.1 without phrase this works as expected , 
however as soon as I do phrase search it fails for the reason as i mentioned 
above.

Let me know if I can clarify further.

On Thu, Nov 30, 2017 at 6:31 PM, Georgy Nevsky <gnevsky.cn...@thomasnet.com>
wrote:

> I wish to understand if I can do something to get in result term "shipping"
> when search for "shipp*"?
>
> Here field definition:
>  multiValued="false"/>
>
>  positionIncrementGap="100">
>   
> 
>  ignoreCase="true"
> words="lang/stopwords_en.txt"
> />
> 
> 
>  protected="protwords.txt"/>
> 
>   
>
> Anything else can be important? Most configuration parameters are 
> default to Apache Solr 7.1.0.
>
> In the best we trust
> Georgy Nevsky
>
>
> -Original Message-
> From: Rick Leir [mailto:rl...@leirtech.com]
> Sent: Thursday, November 30, 2017 7:32 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Wildcard Search
>
> George,
> When you get those results it could be due to stemming.
>
> Wildcard processing expands your term to multiple terms, OR'd 
> together. It also takes you down a different analysis pathway, as many 
> analysis components do not work with multiple terms. Look into the 
> SolrAdmin console, and use the analysis tab to understand what is 
> going on.
>
> If you still have doubts, tell us more about your config.
> Cheers --Rick
>
>
> On November 30, 2017 7:06:42 AM EST, Georgy Nevsky 
> <gnevsky.cn...@thomasnet.com> wrote:
> >Can somebody help me understand how Solr Wildcard Search is working?
> >
> >If I’m doing search for “ship*” term I’m getting in result many 
> >strings, like “Shipping Weight”, “Ship From”, “Shipping Calculator”, 
> >etc.
> >
> >But if I’m searching for “shipp*” I don’t get any result.
> >
> >
> >
> >In the best we trust
> >
> >Georgy Nevsky
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com
>


Re: Solr Wildcard Search

2017-11-30 Thread Atita Arora
As Rick raised the most important aspect here , that the phrase is broken
into multiple terms ORed together ,
I believe if the use case requires to perform wildcard search on phrases ,
we would need to store the entire phrase as a single term in the index
which probably is not happening right now and hence are not found when sent
across as phrases.
I tried this on my local Solr 7.1 without phrase this works as expected ,
however as soon as I do phrase search it fails for the reason as i
mentioned above.

Let me know if I can clarify further.

On Thu, Nov 30, 2017 at 6:31 PM, Georgy Nevsky <gnevsky.cn...@thomasnet.com>
wrote:

> I wish to understand if I can do something to get in result term "shipping"
> when search for "shipp*"?
>
> Here field definition:
>  multiValued="false"/>
>
>  positionIncrementGap="100">
>   
> 
>  ignoreCase="true"
> words="lang/stopwords_en.txt"
> />
> 
> 
>  protected="protwords.txt"/>
> 
>   
>
> Anything else can be important? Most configuration parameters are default
> to
> Apache Solr 7.1.0.
>
> In the best we trust
> Georgy Nevsky
>
>
> -----Original Message-
> From: Rick Leir [mailto:rl...@leirtech.com]
> Sent: Thursday, November 30, 2017 7:32 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Wildcard Search
>
> George,
> When you get those results it could be due to stemming.
>
> Wildcard processing expands your term to multiple terms, OR'd together. It
> also takes you down a different analysis pathway, as many analysis
> components do not work with multiple terms. Look into the SolrAdmin
> console,
> and use the analysis tab to understand what is going on.
>
> If you still have doubts, tell us more about your config.
> Cheers --Rick
>
>
> On November 30, 2017 7:06:42 AM EST, Georgy Nevsky
> <gnevsky.cn...@thomasnet.com> wrote:
> >Can somebody help me understand how Solr Wildcard Search is working?
> >
> >If I’m doing search for “ship*” term I’m getting in result many
> >strings, like “Shipping Weight”, “Ship From”, “Shipping Calculator”,
> >etc.
> >
> >But if I’m searching for “shipp*” I don’t get any result.
> >
> >
> >
> >In the best we trust
> >
> >Georgy Nevsky
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com
>


RE: Solr Wildcard Search

2017-11-30 Thread Georgy Nevsky
I wish to understand if I can do something to get in result term "shipping"
when search for "shipp*"?

Here field definition:



  






  

Anything else can be important? Most configuration parameters are default to
Apache Solr 7.1.0.

In the best we trust
Georgy Nevsky


-Original Message-
From: Rick Leir [mailto:rl...@leirtech.com]
Sent: Thursday, November 30, 2017 7:32 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Wildcard Search

George,
When you get those results it could be due to stemming.

Wildcard processing expands your term to multiple terms, OR'd together. It
also takes you down a different analysis pathway, as many analysis
components do not work with multiple terms. Look into the SolrAdmin console,
and use the analysis tab to understand what is going on.

If you still have doubts, tell us more about your config.
Cheers --Rick


On November 30, 2017 7:06:42 AM EST, Georgy Nevsky
<gnevsky.cn...@thomasnet.com> wrote:
>Can somebody help me understand how Solr Wildcard Search is working?
>
>If I’m doing search for “ship*” term I’m getting in result many
>strings, like “Shipping Weight”, “Ship From”, “Shipping Calculator”,
>etc.
>
>But if I’m searching for “shipp*” I don’t get any result.
>
>
>
>In the best we trust
>
>Georgy Nevsky

--
Sorry for being brief. Alternate email is rickleir at yahoo dot com


Re: Solr Wildcard Search

2017-11-30 Thread Rick Leir
George,
When you get those results it could be due to stemming.

Wildcard processing expands your term to multiple terms, OR'd together. It also 
takes you down a different analysis pathway, as many analysis components do not 
work with multiple terms. Look into the SolrAdmin console, and use the analysis 
tab to understand what is going on.

If you still have doubts, tell us more about your config.
Cheers --Rick


On November 30, 2017 7:06:42 AM EST, Georgy Nevsky 
<gnevsky.cn...@thomasnet.com> wrote:
>Can somebody help me understand how Solr Wildcard Search is working?
>
>If I’m doing search for “ship*” term I’m getting in result many
>strings,
>like “Shipping Weight”, “Ship From”, “Shipping Calculator”, etc.
>
>But if I’m searching for “shipp*” I don’t get any result.
>
>
>
>In the best we trust
>
>Georgy Nevsky

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Solr Wildcard Search

2017-11-30 Thread Georgy Nevsky
Can somebody help me understand how Solr Wildcard Search is working?

If I’m doing search for “ship*” term I’m getting in result many strings,
like “Shipping Weight”, “Ship From”, “Shipping Calculator”, etc.

But if I’m searching for “shipp*” I don’t get any result.



In the best we trust

Georgy Nevsky


Solr Wildcard Search for large amount of text

2015-06-27 Thread octopus
Hi, I'm looking at Solr's features for wildcard search used for a large
amount of text. I read on the net that solr.EdgeNGramFilterFactory is used
to generate tokens for wildcard searching. 

For Nigerian = ni, nig, nige, niger, nigeri, nigeria,
nigeria, nigerian

However, I have a large amount of text out there which requires wildcard
search and it's not viable to use EdgeNGrameFilterFactory as the amount of
processing will be too huge. Do you have any suggestions/advice please?

Thank you so much for your time! 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Wildcard-Search-for-large-amount-of-text-tp4214392.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Wildcard Search for large amount of text

2015-06-27 Thread Shawn Heisey
On 6/27/2015 4:27 AM, octopus wrote:
 Hi, I'm looking at Solr's features for wildcard search used for a large
 amount of text. I read on the net that solr.EdgeNGramFilterFactory is used
 to generate tokens for wildcard searching. 
 
 For Nigerian = ni, nig, nige, niger, nigeri, nigeria,
 nigeria, nigerian
 
 However, I have a large amount of text out there which requires wildcard
 search and it's not viable to use EdgeNGrameFilterFactory as the amount of
 processing will be too huge. Do you have any suggestions/advice please?

Both edgengrams and wildcards are ways to do this.  There are advantages
and disadvantages to both ways.

To do a wildcard search, Solr (Lucene really) must look up all the
matching terms in the index and substitute them into the query so that
it becomes a large number of simple string matches.  If you have a large
number of terms in your index, that can be slow.  The expensive work
(expanding the terms) is done for every single query.

The edgengram filter does similar work, but it does it at *index* time,
rather than query time.  At query time, you are doing a simple string
match with one term, although the index contains many more terms,
because the very expensive work was done at index time.

It's difficult to know which approach will be more efficient on *your*
index without experimentation, but there is a general rule when it comes
to Solr performance: As much as possible, do the expensive work at index
time.

Thanks,
Shawn



Re: Solr Wildcard Search for large amount of text

2015-06-27 Thread Upayavira
That is one way to implement wildcarda, but isnt the most efficient.

Just index normally, tokenized, and search with an asterisk suffix, e.g.
foo*

This will build a finite state transformer that will make wildcard
handling efficient.

Upayavira

On, Jun 27, 2015, at 11:27 AM, pus wrote:
 Hi, I'm looking at Solr's features for wildcard search used for a large
 amount of text. I read on the net that solr.EdgeNGramFilterFactory is
 used
 to generate tokens for wildcard searching. 
 
 For Nigerian = ni, nig, nige, niger, nigeri, nigeria,
 nigeria, nigerian
 
 However, I have a large amount of text out there which requires wildcard
 search and it's not viable to use EdgeNGrameFilterFactory as the amount
 of
 processing will be too huge. Do you have any suggestions/advice please?
 
 Thank you so much for your time! 
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-Wildcard-Search-for-large-amount-of-text-tp4214392.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Wildcard Search for large amount of text

2015-06-27 Thread Erick Erickson
Try it and see ;).

My experience is that wildcards work fine, although
what fine is up to you to decide _if_ you restrict
it to requiring at least two leading real characters,
and I actually prefer three. I.e.
ab* or abc*. Note that if you require leading
wildcards, use the reverse wildcard filter.

I will vociferously argue that single-letter wildcards are
not useful anyway. I mean every single document in your
corpus will probably match every single-letter wildcard
(a*, b*, whatever), providing no benefit to the user.

And, the need for wildcards can often be reduced or
eliminated if you use can autosuggest or autocomplete.
Of course if you're trying to satisfy more complex use
cases where the user is composing their own complex
clauses that may not apply.

FWIW,
Erick

On Sat, Jun 27, 2015 at 10:06 AM, Shawn Heisey apa...@elyograg.org wrote:
 On 6/27/2015 4:27 AM, octopus wrote:
 Hi, I'm looking at Solr's features for wildcard search used for a large
 amount of text. I read on the net that solr.EdgeNGramFilterFactory is used
 to generate tokens for wildcard searching.

 For Nigerian = ni, nig, nige, niger, nigeri, nigeria,
 nigeria, nigerian

 However, I have a large amount of text out there which requires wildcard
 search and it's not viable to use EdgeNGrameFilterFactory as the amount of
 processing will be too huge. Do you have any suggestions/advice please?

 Both edgengrams and wildcards are ways to do this.  There are advantages
 and disadvantages to both ways.

 To do a wildcard search, Solr (Lucene really) must look up all the
 matching terms in the index and substitute them into the query so that
 it becomes a large number of simple string matches.  If you have a large
 number of terms in your index, that can be slow.  The expensive work
 (expanding the terms) is done for every single query.

 The edgengram filter does similar work, but it does it at *index* time,
 rather than query time.  At query time, you are doing a simple string
 match with one term, although the index contains many more terms,
 because the very expensive work was done at index time.

 It's difficult to know which approach will be more efficient on *your*
 index without experimentation, but there is a general rule when it comes
 to Solr performance: As much as possible, do the expensive work at index
 time.

 Thanks,
 Shawn



Re: Solr Wildcard Search for large amount of text

2015-06-27 Thread Jack Krupansky
What do you want actual user queries to look like? I mean, having to
explicitly write asterisks after every term is a real pain.

Indexing ngrams has the advantage that phrase queries and edismax phrase
boosting work automatically. Phrases don't work with explicit wildcard
queries.

The only real downside to ngrams is that they explode the size of the
index. But memory is supposed to be cheap these days. I mean, compare the
cost of the extra RAM (to keep the full index in memory) to the cost to
users of tehir productivity constructing queries and having expensive staff
to help them figure out why various queries don't work as expected.

How big is your corpus - number of documents and average document size?

-- Jack Krupansky

On Sat, Jun 27, 2015 at 6:27 AM, octopus octroll...@gmail.com wrote:

 Hi, I'm looking at Solr's features for wildcard search used for a large
 amount of text. I read on the net that solr.EdgeNGramFilterFactory is used
 to generate tokens for wildcard searching.

 For Nigerian = ni, nig, nige, niger, nigeri, nigeria,
 nigeria, nigerian

 However, I have a large amount of text out there which requires wildcard
 search and it's not viable to use EdgeNGrameFilterFactory as the amount of
 processing will be too huge. Do you have any suggestions/advice please?

 Thank you so much for your time!



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-Wildcard-Search-for-large-amount-of-text-tp4214392.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr wildcard search

2013-09-14 Thread Erick Erickson
Also be aware that some analysis steps may not
be performed on wildcards. The filter has to be
MultTermAware. See:

https://wiki.apache.org/solr/MultitermQueryAnalysis
and
http://searchhub.org/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/

Best,
Erick

On Fri, Sep 13, 2013 at 12:12 PM, Jack Krupansky j...@basetechnology.comwrote:

 Wildcard applies only to a single term. The escaped space suggests that
 you are trying to match a wildcard on multiple terms.

 Try the contrib complex phrase query parser.

 -- Jack Krupansky

 -Original Message- From: Prasi S
 Sent: Friday, September 13, 2013 6:37 AM
 To: solr-user@lucene.apache.org
 Subject: Solr wildcard search


 Hi all,
 I am working with wildcard queries and few things are confusing.

 1. Does a wildcard search omit the analysers on a particular field?

 2. I have searched for
 q=google\ technology - gives result
 q=google technology - Gives results
 q=google tech*   - gives results
 q=google\ tech* - 0 results. The debug Query for the last query is str
 name=parsedquery_toString**text:google tech*/str

 Why does this happen.


 Thanks,
 Prasi



Re: Solr wildcard search

2013-09-13 Thread Jack Krupansky
Wildcard applies only to a single term. The escaped space suggests that you 
are trying to match a wildcard on multiple terms.


Try the contrib complex phrase query parser.

-- Jack Krupansky

-Original Message- 
From: Prasi S

Sent: Friday, September 13, 2013 6:37 AM
To: solr-user@lucene.apache.org
Subject: Solr wildcard search

Hi all,
I am working with wildcard queries and few things are confusing.

1. Does a wildcard search omit the analysers on a particular field?

2. I have searched for
q=google\ technology - gives result
q=google technology - Gives results
q=google tech*   - gives results
q=google\ tech* - 0 results. The debug Query for the last query is str
name=parsedquery_toStringtext:google tech*/str

Why does this happen.


Thanks,
Prasi 



Solr wildcard search

2013-09-13 Thread Prasi S
Hi all,
I am working with wildcard queries and few things are confusing.

1. Does a wildcard search omit the analysers on a particular field?

2. I have searched for
q=google\ technology - gives result
q=google technology - Gives results
q=google tech*   - gives results
q=google\ tech* - 0 results. The debug Query for the last query is str
name=parsedquery_toStringtext:google tech*/str

Why does this happen.


Thanks,
Prasi