subject:"Wildcard \? issue\?"

Re: Index-time synonyms and trailing wildcard issue

2013-02-14 Thread Johannes Rodenwald

Hello Jack,

Thanks for your answer, it helped me gaining a deeper understandig what happens 
at index time, and finding a solution myself:

It seems that putting the synonym filter in both filter chains (index and 
query), setting expand="false", and putting the desired synonym first in the 
row, does the trick:
Synonyms line (reversed order!):
orange, apfelsine

All documents containing "apfelsine" are now mapped to "orange", so there are 
no more documets containing "apfelsine" that would match a wildcard-query for 
"apfel*"  ("Apfelsine" is a true synonym for "Orange" in german, meaning 
"chinese apple". "Apfel" = apple, shouldnt match oranges).

Problem solved, thanks again for the help!

Johannes Rodenwald 

- Ursprüngliche Mail -
Von: "Jack Krupansky" 
An: solr-user@lucene.apache.org
Gesendet: Mittwoch, 13. Februar 2013 17:17:40
Betreff: Re: Index-time synonyms and trailing wildcard issue

By doing synonyms at index time, you cause "apfelsin" to be added to 
documents that contain only "orang", so of course documents that previously 
only contained "orang" will now match for "apfelsin" or any term query that 
matches "apfelsin", such as a wildcard. At query time, Lucene cannot tell 
whether your original document contained "apfelsin" or if "apfelsin" was 
added when the document was indexed due to an index-time synonym.

Solution: Either disable index time synonyms, or have a parallel field (via 
copyField) that does not have the index-time synonyms.

But... perhaps you should clarify what you really intend to happen with 
these pseudo-synonyms.

-- Jack Krupansky

Re: Index-time synonyms and trailing wildcard issue

2013-02-13 Thread Jack Krupansky

By doing synonyms at index time, you cause "apfelsin" to be added to 
documents that contain only "orang", so of course documents that previously 
only contained "orang" will now match for "apfelsin" or any term query that 
matches "apfelsin", such as a wildcard. At query time, Lucene cannot tell 
whether your original document contained "apfelsin" or if "apfelsin" was 
added when the document was indexed due to an index-time synonym.


Solution: Either disable index time synonyms, or have a parallel field (via 
copyField) that does not have the index-time synonyms.


But... perhaps you should clarify what you really intend to happen with 
these pseudo-synonyms.


-- Jack Krupansky

-Original Message- 
From: Johannes Rodenwald

Sent: Wednesday, February 13, 2013 10:25 AM
To: solr-user@lucene.apache.org
Subject: Index-time synonyms and trailing wildcard issue

Hi,

I use Solr 3.6.0 with a synonym filter as the last filter at index time, 
using a list of stemmed terms. When i do a wildcard search that matches a 
part of an entry on the synonym list, the synonyms found are used by solr to 
generate the search results. I am trying to disable that behaviour, but with 
no success.


Example:

Stemmed synonyms:
apfelsin, orang

Search term:
apfel*

Matches:
Apfelkuchen, Apfelsaft, Apfelsine... (good, i want these matches)
Orange (bad, i dont want this match)

My questions are:
- Why does the synonym filter react on a wildcard query? For it is not a 
multiterm-aware component (see 
http://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/MultiTermAwareComponent.html)
- How can i disable this behaviour, so that "Orange" is no longer returned 
by the query for "apfel*"?


Regards,

Johannes

Index-time synonyms and trailing wildcard issue

2013-02-13 Thread Johannes Rodenwald

Hi,

I use Solr 3.6.0 with a synonym filter as the last filter at index time, using 
a list of stemmed terms. When i do a wildcard search that matches a part of an 
entry on the synonym list, the synonyms found are used by solr to generate the 
search results. I am trying to disable that behaviour, but with no success.

Example:

Stemmed synonyms: 
apfelsin, orang

Search term:
apfel*

Matches:
Apfelkuchen, Apfelsaft, Apfelsine... (good, i want these matches)
Orange (bad, i dont want this match)

My questions are:
- Why does the synonym filter react on a wildcard query? For it is not a 
multiterm-aware component (see 
http://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/MultiTermAwareComponent.html)
- How can i disable this behaviour, so that "Orange" is no longer returned by 
the query for "apfel*"?

Regards,

Johannes

Re: Wildcard ? issue?

2012-02-09 Thread Erick Erickson

You can pull down 3.5 (aka 3.x) from the nightly build if you want, see:
https://builds.apache.org//view/S-Z/view/Solr/job/Solr-3.x/
the "last successful artifacts" link will probably be what you want.

Best
Erick

On Thu, Feb 9, 2012 at 5:35 AM, Dalius Sidlauskas
 wrote:
> Okay, I get it, 3.6 is not released yet. Thanks for help fellas!
>
> Regards!
> Dalius Sidlauskas
>
>
>
> On 09/02/12 10:19, Dalius Sidlauskas wrote:
>>
>> It seams it is applicable for Solr 3.6 and 4.0. Mines version is 3.5
>>
>> Regards!
>> Dalius Sidlauskas
>>
>>
>> On 08/02/12 17:26, Ahmet Arslan wrote:

 I have already tried this and it did
 not helped because it does not
 highlight matches if wild-card is used. The field
 configuration turns
 data to:
>>>
>>> This writeup should explain your scenario :
>>> http://wiki.apache.org/solr/MultitermQueryAnalysis

Re: Wildcard ? issue?

2012-02-09 Thread Dalius Sidlauskas


Okay, I get it, 3.6 is not released yet. Thanks for help fellas!

Regards!
Dalius Sidlauskas


On 09/02/12 10:19, Dalius Sidlauskas wrote:

It seams it is applicable for Solr 3.6 and 4.0. Mines version is 3.5

Regards!
Dalius Sidlauskas


On 08/02/12 17:26, Ahmet Arslan wrote:

I have already tried this and it did
not helped because it does not
highlight matches if wild-card is used. The field
configuration turns
data to:

This writeup should explain your scenario :
http://wiki.apache.org/solr/MultitermQueryAnalysis

Re: Wildcard ? issue?

2012-02-09 Thread Dalius Sidlauskas


It seams it is applicable for Solr 3.6 and 4.0. Mines version is 3.5

Regards!
Dalius Sidlauskas


On 08/02/12 17:26, Ahmet Arslan wrote:

I have already tried this and it did
not helped because it does not
highlight matches if wild-card is used. The field
configuration turns
data to:

This writeup should explain your scenario :
http://wiki.apache.org/solr/MultitermQueryAnalysis

Re: Wildcard ? issue?

2012-02-08 Thread Ahmet Arslan

> I have already tried this and it did
> not helped because it does not 
> highlight matches if wild-card is used. The field
> configuration turns 
> data to:

This writeup should explain your scenario :
http://wiki.apache.org/solr/MultitermQueryAnalysis

Re: Wildcard ? issue?

2012-02-08 Thread Dalius Sidlauskas

I have already tried this and it did not helped because it does not 
highlight matches if wild-card is used. The field configuration turns 
data to:


dc_title: calligraf
dc_title_unicode: cal·lígraf
dc_title_unicode_full: cal·lígraf

Debug parsedquery says:

[Search for *cal·ligraf*]

+DisjunctionMaxQuery((dc_title:*calligraf* |  
dc_title_unicode:cal·ligraf^2.0 | dc_title_unicode_full:cal·ligraf^2.0))


[Search for *cal·ligra?*]

+DisjunctionMaxQuery((dc_title:*cal·ligra?* | 
dc_title_unicode:cal·ligra?^2.0 | dc_title_unicode_full:cal·ligra?^2.0))


Why the *dc_title* field is handled differently? The analysis looks fine:


 Index Analyzer


   org.apache.solr.analysis.HTMLStripCharFilterFactory
   {luceneMatchVersion=LUCENE_34}

textcal·lígraf


   org.apache.solr.analysis.PatternReplaceCharFilterFactory
   {replacement=, pattern=-, maxBlockChars=1,
   luceneMatchVersion=LUCENE_34, blockDelimiters=}

textcal·lígraf


   org.apache.solr.analysis.WhitespaceTokenizerFactory
   {luceneMatchVersion=LUCENE_34}

position1
term text   cal·lígraf
startOffset 43
endOffset   53


   org.apache.solr.analysis.ICUFoldingFilterFactory
   {luceneMatchVersion=LUCENE_34}

position1
term text   calligraf
startOffset 43
endOffset   53


 Query Analyzer


   org.apache.solr.analysis.WhitespaceTokenizerFactory
   {luceneMatchVersion=LUCENE_34}

position1
term text   cal·ligra?
startOffset 0
endOffset   10


   org.apache.solr.analysis.ICUFoldingFilterFactory
   {luceneMatchVersion=LUCENE_34}

position1
term text   calligra?
startOffset 0
endOffset   10


Is this a Solr or Lucene bug?

Regards!
Dalius Sidlauskas


On 08/02/12 16:03, Sethi, Parampreet wrote:

Hi Dalius,

If not already tried, Check http://localhost:8983/solr/admin/analysis.jsp
(enable verbose output for both Field Value index and query for details)
for your queries and see what all filters/tokenizers are being applied.

Hope it helps!

-param

On 2/8/12 10:48 AM, "Dalius Sidlauskas"
wrote:


If you can not read this mail easily check this ticket:
https://issues.apache.org/jira/browse/SOLR-3106 This is a copy.

Regards!
Dalius Sidlauskas


On 08/02/12 15:44, Dalius Sidlauskas wrote:

Sorry for inaccurate title.

I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full)
containing same value:

http://www.tei-c.org/ns/1.0";>cal.lígraf

and these fields are configured accordingly:

































And finally my search configuration:



all
edismax
2<-25%
dc_title_unicode_full^2 dc_title_unicode^2 dc_title
10
true
false
1


spellcheck



I am trying to match the field with various search phrases (that are
valid). There are results:


# search phrase match? Comment
1 cal.lígra? yes
2 cal.ligra? no Changed í to i
3 cal.ligraf yes
4 calligra? no


The problem is the #2 attempt to match a data. The #3 works replacing
? with f.

One more thing. If * is used insted of ? other data is matched as
cal.lígrafia but not cal.lígraf...

Also I have spotted some logic missmatch in debug parsedQuery field:
*
cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 |
dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0))
*cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 |
dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0))

Should the second be "*calligra?*" insted?*

*Environment:
Tomcat 7.0.25 (request encoding UTF-8)
Solr 3.5.0
Java 7 Oracle
Ubuntu 11.10

Re: Wildcard ? issue?

2012-02-08 Thread Sethi, Parampreet

Hi Dalius,

If not already tried, Check http://localhost:8983/solr/admin/analysis.jsp
(enable verbose output for both Field Value index and query for details)
for your queries and see what all filters/tokenizers are being applied.

Hope it helps!

-param

On 2/8/12 10:48 AM, "Dalius Sidlauskas" 
wrote:

>If you can not read this mail easily check this ticket:
>https://issues.apache.org/jira/browse/SOLR-3106 This is a copy.
>
>Regards!
>Dalius Sidlauskas
>
>
>On 08/02/12 15:44, Dalius Sidlauskas wrote:
>> Sorry for inaccurate title.
>>
>> I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full)
>> containing same value:
>>
>> http://www.tei-c.org/ns/1.0";>cal.lígraf
>>
>> and these fields are configured accordingly:
>>
>> > positionIncrementGap="100">
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>
>> > positionIncrementGap="100">
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>
>> > positionIncrementGap="100">
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>
>> And finally my search configuration:
>>
>> 
>> 
>> all
>> edismax
>> 2<-25%
>> dc_title_unicode_full^2 dc_title_unicode^2 dc_title
>> 10
>> true
>> false
>> 1
>> 
>> 
>> spellcheck
>> 
>> 
>>
>> I am trying to match the field with various search phrases (that are
>> valid). There are results:
>>
>>
>> # search phrase match? Comment
>> 1 cal.lígra? yes
>> 2 cal.ligra? no Changed í to i
>> 3 cal.ligraf yes
>> 4 calligra? no
>>
>>
>> The problem is the #2 attempt to match a data. The #3 works replacing
>> ? with f.
>>
>> One more thing. If * is used insted of ? other data is matched as
>> cal.lígrafia but not cal.lígraf...
>>
>> Also I have spotted some logic missmatch in debug parsedQuery field:
>> *
>> cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 |
>> dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0))
>> *cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 |
>> dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0))
>>
>> Should the second be "*calligra?*" insted?*
>>
>> *Environment:
>> Tomcat 7.0.25 (request encoding UTF-8)
>> Solr 3.5.0
>> Java 7 Oracle
>> Ubuntu 11.10
>>

Re: Wildcard ? issue?

2012-02-08 Thread Dalius Sidlauskas

If you can not read this mail easily check this ticket: 
https://issues.apache.org/jira/browse/SOLR-3106 This is a copy.


Regards!
Dalius Sidlauskas


On 08/02/12 15:44, Dalius Sidlauskas wrote:

Sorry for inaccurate title.

I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full) 
containing same value:


http://www.tei-c.org/ns/1.0";>cal.lígraf

and these fields are configured accordingly:

positionIncrementGap="100">












positionIncrementGap="100">










positionIncrementGap="100">










And finally my search configuration:



all
edismax
2<-25%
dc_title_unicode_full^2 dc_title_unicode^2 dc_title
10
true
false
1


spellcheck



I am trying to match the field with various search phrases (that are 
valid). There are results:



# search phrase match? Comment
1 cal.lígra? yes
2 cal.ligra? no Changed í to i
3 cal.ligraf yes
4 calligra? no


The problem is the #2 attempt to match a data. The #3 works replacing 
? with f.


One more thing. If * is used insted of ? other data is matched as 
cal.lígrafia but not cal.lígraf...


Also I have spotted some logic missmatch in debug parsedQuery field:
*
cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 | 
dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0))
*cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 | 
dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0))


Should the second be "*calligra?*" insted?*

*Environment:
Tomcat 7.0.25 (request encoding UTF-8)
Solr 3.5.0
Java 7 Oracle
Ubuntu 11.10

Wildcard ? issue?

2012-02-08 Thread Dalius Sidlauskas


Sorry for inaccurate title.

I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full) 
containing same value:


http://www.tei-c.org/ns/1.0";>cal.lígraf

and these fields are configured accordingly:


  



  
  


  



  


  
  

  



  


  
  

  


And finally my search configuration:


 
   all
   edismax
   2<-25%
   dc_title_unicode_full^2 dc_title_unicode^2 
dc_title
   10
   true
   false
   1
 

  spellcheck



I am trying to match the field with various search phrases (that are 
valid). There are results:



#   search phrase   match?  Comment
1   cal.lígra?  yes 
2   cal.ligra?  no  Changed í to i
3   cal.ligraf  yes 
4   calligra?   no  


The problem is the #2 attempt to match a data. The #3 works replacing ? 
with f.


One more thing. If * is used insted of ? other data is matched as 
cal.lígrafia but not cal.lígraf...


Also I have spotted some logic missmatch in debug parsedQuery field:
*
cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 | 
dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0))
*cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 | 
dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0))


Should the second be "*calligra?*" insted?*

*Environment:
Tomcat 7.0.25 (request encoding UTF-8)
Solr 3.5.0
Java 7 Oracle
Ubuntu 11.10

--
Regards!
Dalius Sidlauskas

Re: Index-time synonyms and trailing wildcard issue

Re: Index-time synonyms and trailing wildcard issue

Index-time synonyms and trailing wildcard issue

Re: Wildcard ? issue?

Re: Wildcard ? issue?

Re: Wildcard ? issue?

Re: Wildcard ? issue?

Re: Wildcard ? issue?

Re: Wildcard ? issue?

Re: Wildcard ? issue?

Wildcard ? issue?

11 matches

Site Navigation

Mail list logo

Footer information