Re: Index-time synonyms and trailing wildcard issue

2013-02-14 Thread Johannes Rodenwald
Hello Jack,

Thanks for your answer, it helped me gaining a deeper understandig what happens 
at index time, and finding a solution myself:

It seems that putting the synonym filter in both filter chains (index and 
query), setting expand="false", and putting the desired synonym first in the 
row, does the trick:
Synonyms line (reversed order!):
orange, apfelsine

All documents containing "apfelsine" are now mapped to "orange", so there are 
no more documets containing "apfelsine" that would match a wildcard-query for 
"apfel*"  ("Apfelsine" is a true synonym for "Orange" in german, meaning 
"chinese apple". "Apfel" = apple, shouldnt match oranges).

Problem solved, thanks again for the help!

Johannes Rodenwald 

- Ursprüngliche Mail -
Von: "Jack Krupansky" 
An: solr-user@lucene.apache.org
Gesendet: Mittwoch, 13. Februar 2013 17:17:40
Betreff: Re: Index-time synonyms and trailing wildcard issue

By doing synonyms at index time, you cause "apfelsin" to be added to 
documents that contain only "orang", so of course documents that previously 
only contained "orang" will now match for "apfelsin" or any term query that 
matches "apfelsin", such as a wildcard. At query time, Lucene cannot tell 
whether your original document contained "apfelsin" or if "apfelsin" was 
added when the document was indexed due to an index-time synonym.

Solution: Either disable index time synonyms, or have a parallel field (via 
copyField) that does not have the index-time synonyms.

But... perhaps you should clarify what you really intend to happen with 
these pseudo-synonyms.

-- Jack Krupansky




Re: Index-time synonyms and trailing wildcard issue

2013-02-13 Thread Jack Krupansky
By doing synonyms at index time, you cause "apfelsin" to be added to 
documents that contain only "orang", so of course documents that previously 
only contained "orang" will now match for "apfelsin" or any term query that 
matches "apfelsin", such as a wildcard. At query time, Lucene cannot tell 
whether your original document contained "apfelsin" or if "apfelsin" was 
added when the document was indexed due to an index-time synonym.


Solution: Either disable index time synonyms, or have a parallel field (via 
copyField) that does not have the index-time synonyms.


But... perhaps you should clarify what you really intend to happen with 
these pseudo-synonyms.


-- Jack Krupansky

-Original Message- 
From: Johannes Rodenwald

Sent: Wednesday, February 13, 2013 10:25 AM
To: solr-user@lucene.apache.org
Subject: Index-time synonyms and trailing wildcard issue

Hi,

I use Solr 3.6.0 with a synonym filter as the last filter at index time, 
using a list of stemmed terms. When i do a wildcard search that matches a 
part of an entry on the synonym list, the synonyms found are used by solr to 
generate the search results. I am trying to disable that behaviour, but with 
no success.


Example:

Stemmed synonyms:
apfelsin, orang

Search term:
apfel*

Matches:
Apfelkuchen, Apfelsaft, Apfelsine... (good, i want these matches)
Orange (bad, i dont want this match)

My questions are:
- Why does the synonym filter react on a wildcard query? For it is not a 
multiterm-aware component (see 
http://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/MultiTermAwareComponent.html)
- How can i disable this behaviour, so that "Orange" is no longer returned 
by the query for "apfel*"?


Regards,

Johannes 



Index-time synonyms and trailing wildcard issue

2013-02-13 Thread Johannes Rodenwald
Hi,

I use Solr 3.6.0 with a synonym filter as the last filter at index time, using 
a list of stemmed terms. When i do a wildcard search that matches a part of an 
entry on the synonym list, the synonyms found are used by solr to generate the 
search results. I am trying to disable that behaviour, but with no success.

Example:

Stemmed synonyms: 
apfelsin, orang

Search term:
apfel*

Matches:
Apfelkuchen, Apfelsaft, Apfelsine... (good, i want these matches)
Orange (bad, i dont want this match)

My questions are:
- Why does the synonym filter react on a wildcard query? For it is not a 
multiterm-aware component (see 
http://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/MultiTermAwareComponent.html)
- How can i disable this behaviour, so that "Orange" is no longer returned by 
the query for "apfel*"?

Regards,

Johannes


Re: Synonyms and trailing wildcard

2013-01-15 Thread Jack Krupansky
It's certainly true that wildcard suppresses the synonym filter since it is 
not "multi-term aware."


Other than implementing your own version of the synonym filter that was 
multi-term aware and interpreted wildcards, you may have to do your own 
preprocessor.


Or, you could do index-time synonyms, so that "bill", "billy", "will", 
"willy", and "william" were all indexed at the same location. Then the bil* 
wildcard would match "william" since"bill" is also indexed at the same 
location.


-- Jack Krupansky

-Original Message- 
From: Roberto Isaac Gonzalez

Sent: Tuesday, January 15, 2013 3:10 PM
To: solr-user@lucene.apache.org
Subject: Synonyms and trailing wildcard

Hi

I'm working on adding nicknames capability to our system. It's basically a
synonym mapping stored in a nicknames.txt file that uses the SynonymFilter
framework.

In one of our search boxes (used for lookups), we automatically append a
trailing wildcard.

There's one use case we're dealing with which is expanding synonyms even if
there's a trailing wildcard.

i.e. Q: Bill*
Expected Results: Bill, Billie, William

Q: Bil*
Expected Results: Bill, so no synonym expansion.

Basically, for synonym expansion, we want to treat the token as if it
didn't contain the trailing wildcard and we also *don't* want to expand the
wildcard before doing the synonym matches.

We tried using the multiterm analysis chain but by definition that expects
one token *in* and one token
*out*(org.apache.solr.schema.TextField.analyzeMultiTerm()) so it
throws an
exception.

I'm looking for options about implementing this scenario and some of the
options I've explored are:

1. Use the multiterm analysis chain and allow Synonym expansion, so one
token in and multiple tokens out.
2. Iterate ourselves and see if the multiterm analysis chain returns more
than one token, if it does, then remove the SynonymFilter from the analysis
chain, something similar to ExtendedDismaxQParser.shouldRemoveStopFilter().
3. ExtendedDismaxQParser.preProcessUserQuery() to OR the non-wildcarded
term.

What do you guys think?


Best Regards,
Roberto Gonzalez