[jira] [Commented] (SOLR-11968) Multi-words query time synonyms

2018-02-23 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374145#comment-16374145
 ] 

Jim Ferenczi commented on SOLR-11968:
-

bq. Jim Ferenczi, sorry, based on feedback from Robert over on LUCENE-4065 (see 
my comments there), I no longer see how your "twd/the walking dead" example 
represents graph corruption. Can you say more about why you call it corruption?

[~steve_rowe] what I mean by corruption is that we need to "infer" the holes 
when we build the graph from the token stream. For some cases we are able to 
reconstruct the correct graph (see how TokenStreamToAutomaton does when it sees 
dead states) but there are cases where it is not possible. Here is another 
example:

{code:java|title=TestStopFilterFactory.java}

public void testLeadingStopwordSynonymGraph() throws Exception {
  SynonymMap.Builder builder = new SynonymMap.Builder(true);
  builder.add(new CharsRef("twd"), new CharsRef("the\uwalking\udead"), 
true);
  builder.add(new CharsRef("twd"), new CharsRef("the\uman"), false);
  final SynonymMap synonymMap = builder.build();
 
  Analyzer analyzer = new Analyzer() {
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
      MockTokenizer tokenizer = new MockTokenizer();
      TokenStream stream = new SynonymGraphFilter(tokenizer, synonymMap, true);
      stream = new StopFilter(stream, 
CharArraySet.copy(Collections.singleton("the")));
      return new TokenStreamComponents(tokenizer, stream);
    }

  };
 TokenStream tokenStream = analyzer.tokenStream("field", "twd");
 assertTokenStreamContents(tokenStream,
 new String[] \{ "twd", "walking", "dead", "man" },
 null, null,
     new int[] \{ 1,  1, 1,  1 }, // posinc
 new int[] \{ 4,  1, 2,  1 }, // poslen
 null);
}
{code}

In this case "walking" and "man" appears on the same path so the graph contains 
"twd", "walking dead" and "walking man".
The token stream is not corrupted but the graph is wrong and I don't see how we 
can "fix" it outside of the stop filter.

> Multi-words query time synonyms
> ---
>
> Key: SOLR-11968
> URL: https://issues.apache.org/jira/browse/SOLR-11968
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers, Schema and Analysis
>Affects Versions: master (8.0), 6.6.2
> Environment: Centos 7.x
>Reporter: Dominique Béjean
>Assignee: Steve Rowe
>Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and 
> SynonymGraphFilterFactory filter as explain in this article
>  
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>   
>  My field type is :
> {code:java}
> 
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>        
>         ignoreCase="true"/>
>        
>      
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>                      ignoreCase="true" expand="true"/>
>        
>         ignoreCase="true"/>
>        
>      
>    {code}
>  
>  synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>  
>  stopwords.txt contains the word 
> {code:java}
> de{code}
>  
>  The order of words in my query has an impact on the generated query in 
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
>  =false
>  =...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the 
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil 
> +name_text_gp:maillot) name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu 
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the 
> same generated query 
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the 
> synonym expansion is ignored, but the second one shows it is not ignored and 
> only the synonym term is used.
>   
>  When I test the analisys for the field type the synonyms are correctly 
> expanded for both expressions
> {code:java}
> om maillot  
>  maillot om
>  olympique de marseille maillot
>  maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always 
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>  
>  So, i suspect an issue with edismax query parser.



--
This message was sent by 

[jira] [Commented] (SOLR-11968) Multi-words query time synonyms

2018-02-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373936#comment-16373936
 ] 

Robert Muir commented on SOLR-11968:


Also the stupid gap stuff acts different depending on language, how should 
exact phrases match singular<->plural in english but not farsi. And they won't 
match definite article "the" in english but will in bulgarian because thats a 
suffix there. Totally crazy even for non-graph cases :)

> Multi-words query time synonyms
> ---
>
> Key: SOLR-11968
> URL: https://issues.apache.org/jira/browse/SOLR-11968
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers, Schema and Analysis
>Affects Versions: master (8.0), 6.6.2
> Environment: Centos 7.x
>Reporter: Dominique Béjean
>Assignee: Steve Rowe
>Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and 
> SynonymGraphFilterFactory filter as explain in this article
>  
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>   
>  My field type is :
> {code:java}
> 
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>        
>         ignoreCase="true"/>
>        
>      
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>                      ignoreCase="true" expand="true"/>
>        
>         ignoreCase="true"/>
>        
>      
>    {code}
>  
>  synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>  
>  stopwords.txt contains the word 
> {code:java}
> de{code}
>  
>  The order of words in my query has an impact on the generated query in 
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
>  =false
>  =...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the 
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil 
> +name_text_gp:maillot) name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu 
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the 
> same generated query 
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the 
> synonym expansion is ignored, but the second one shows it is not ignored and 
> only the synonym term is used.
>   
>  When I test the analisys for the field type the synonyms are correctly 
> expanded for both expressions
> {code:java}
> om maillot  
>  maillot om
>  olympique de marseille maillot
>  maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always 
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>  
>  So, i suspect an issue with edismax query parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11968) Multi-words query time synonyms

2018-02-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373928#comment-16373928
 ] 

Robert Muir commented on SOLR-11968:


{quote}
AFAICT Robert is suggesting a StopFilter mode that would optionally remove 
gaps. IOW its current behavior would remain (and be the default).
{quote}

I'm not sure about whether it should be the default, first we should even see 
if we can make it work so we can test it out. 

Maybe "leaving a hole/gap" that we do today is actually what is wrong, and just 
doesn't make sense at all now that positionLength is at play? Honestly it was 
kind of strange to begin with, e.g. that stopword removal has no impact on 
phrase queries. For example its definitely not what google seems to do with 
phrase queries, try "walk plank". 

This definitely relates to the whole reason that I opened LUCENE-4065 in the 
first place: there was too much all conflated to one configuration option: the 
strange "gap" stuff mixed together with "don't move synonyms to entirely 
different words" all combined into one boolean.

> Multi-words query time synonyms
> ---
>
> Key: SOLR-11968
> URL: https://issues.apache.org/jira/browse/SOLR-11968
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers, Schema and Analysis
>Affects Versions: master (8.0), 6.6.2
> Environment: Centos 7.x
>Reporter: Dominique Béjean
>Assignee: Steve Rowe
>Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and 
> SynonymGraphFilterFactory filter as explain in this article
>  
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>   
>  My field type is :
> {code:java}
> 
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>        
>         ignoreCase="true"/>
>        
>      
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>                      ignoreCase="true" expand="true"/>
>        
>         ignoreCase="true"/>
>        
>      
>    {code}
>  
>  synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>  
>  stopwords.txt contains the word 
> {code:java}
> de{code}
>  
>  The order of words in my query has an impact on the generated query in 
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
>  =false
>  =...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the 
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil 
> +name_text_gp:maillot) name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu 
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the 
> same generated query 
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the 
> synonym expansion is ignored, but the second one shows it is not ignored and 
> only the synonym term is used.
>   
>  When I test the analisys for the field type the synonyms are correctly 
> expanded for both expressions
> {code:java}
> om maillot  
>  maillot om
>  olympique de marseille maillot
>  maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always 
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>  
>  So, i suspect an issue with edismax query parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11968) Multi-words query time synonyms

2018-02-22 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373775#comment-16373775
 ] 

Steve Rowe commented on SOLR-11968:
---

[~jim.ferenczi], sorry, based on feedback from Robert over on LUCENE-4065 (see 
my comments there), I no longer see how your "twd/the walking dead" example 
represents graph corruption.  Can you say more about why you call it corruption?

> Multi-words query time synonyms
> ---
>
> Key: SOLR-11968
> URL: https://issues.apache.org/jira/browse/SOLR-11968
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers, Schema and Analysis
>Affects Versions: master (8.0), 6.6.2
> Environment: Centos 7.x
>Reporter: Dominique Béjean
>Assignee: Steve Rowe
>Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and 
> SynonymGraphFilterFactory filter as explain in this article
>  
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>   
>  My field type is :
> {code:java}
> 
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>        
>         ignoreCase="true"/>
>        
>      
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>                      ignoreCase="true" expand="true"/>
>        
>         ignoreCase="true"/>
>        
>      
>    {code}
>  
>  synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>  
>  stopwords.txt contains the word 
> {code:java}
> de{code}
>  
>  The order of words in my query has an impact on the generated query in 
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
>  =false
>  =...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the 
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil 
> +name_text_gp:maillot) name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu 
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the 
> same generated query 
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the 
> synonym expansion is ignored, but the second one shows it is not ignored and 
> only the synonym term is used.
>   
>  When I test the analisys for the field type the synonyms are correctly 
> expanded for both expressions
> {code:java}
> om maillot  
>  maillot om
>  olympique de marseille maillot
>  maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always 
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>  
>  So, i suspect an issue with edismax query parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11968) Multi-words query time synonyms

2018-02-22 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373710#comment-16373710
 ] 

Steve Rowe commented on SOLR-11968:
---

Thanks Jim, I didn't realize that StopFilter (and other FilteringTokenFilter's 
I assume) can still produce bad token streams.  I added a test showing this, 
based on your example, to LUCENE-4065.

bq. There are other cases where it is not possible to "fix" the graph produced 
by the token stream which is why I said that a stop filter that would remove 
gaps is IMO the best solution

Do you have examples of these other cases?  Maybe put them on LUCENE-4065?

> Multi-words query time synonyms
> ---
>
> Key: SOLR-11968
> URL: https://issues.apache.org/jira/browse/SOLR-11968
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers, Schema and Analysis
>Affects Versions: master (8.0), 6.6.2
> Environment: Centos 7.x
>Reporter: Dominique Béjean
>Assignee: Steve Rowe
>Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and 
> SynonymGraphFilterFactory filter as explain in this article
>  
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>   
>  My field type is :
> {code:java}
> 
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>        
>         ignoreCase="true"/>
>        
>      
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>                      ignoreCase="true" expand="true"/>
>        
>         ignoreCase="true"/>
>        
>      
>    {code}
>  
>  synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>  
>  stopwords.txt contains the word 
> {code:java}
> de{code}
>  
>  The order of words in my query has an impact on the generated query in 
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
>  =false
>  =...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the 
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil 
> +name_text_gp:maillot) name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu 
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the 
> same generated query 
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the 
> synonym expansion is ignored, but the second one shows it is not ignored and 
> only the synonym term is used.
>   
>  When I test the analisys for the field type the synonyms are correctly 
> expanded for both expressions
> {code:java}
> om maillot  
>  maillot om
>  olympique de marseille maillot
>  maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always 
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>  
>  So, i suspect an issue with edismax query parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11968) Multi-words query time synonyms

2018-02-22 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373554#comment-16373554
 ] 

Jim Ferenczi commented on SOLR-11968:
-

.bq I think you're wrong, [~jim.ferenczi].

well it depends how you see the problem. I agree that the gap could be inferred 
when we build the graph, I have a patch that does that but there are some cases 
where we just can't. For instance the following synonym rules:

`twd, the walking dead` creates a broken token stream if you set a stop word 
filter that removes "the" after the synonym filter:

|| ||twd||walking||dead||
|posinc|1|1|1|
|poslen|3|1|1|

The gap produced by "the" is not propagated to the posInc of "walking" because 
the stop word appears on a token with a posInc equals to 0. There are other 
cases where it is not possible to "fix" the graph produced by the token stream 
which is why I said that a stop filter that would remove gaps is IMO the best 
solution.

.bq AFAICT Robert is suggesting a StopFilter *mode* that would *optionally* 
remove gaps. IOW its current behavior would remain (and be the default).

Yes I know that it would be an optional mode but at least it would allow to 
remove stop words inside a multi words synonyms.

> Multi-words query time synonyms
> ---
>
> Key: SOLR-11968
> URL: https://issues.apache.org/jira/browse/SOLR-11968
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers, Schema and Analysis
>Affects Versions: master (8.0), 6.6.2
> Environment: Centos 7.x
>Reporter: Dominique Béjean
>Assignee: Steve Rowe
>Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and 
> SynonymGraphFilterFactory filter as explain in this article
>  
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>   
>  My field type is :
> {code:java}
> 
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>        
>         ignoreCase="true"/>
>        
>      
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>                      ignoreCase="true" expand="true"/>
>        
>         ignoreCase="true"/>
>        
>      
>    {code}
>  
>  synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>  
>  stopwords.txt contains the word 
> {code:java}
> de{code}
>  
>  The order of words in my query has an impact on the generated query in 
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
>  =false
>  =...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the 
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil 
> +name_text_gp:maillot) name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu 
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the 
> same generated query 
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the 
> synonym expansion is ignored, but the second one shows it is not ignored and 
> only the synonym term is used.
>   
>  When I test the analisys for the field type the synonyms are correctly 
> expanded for both expressions
> {code:java}
> om maillot  
>  maillot om
>  olympique de marseille maillot
>  maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always 
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>  
>  So, i suspect an issue with edismax query parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11968) Multi-words query time synonyms

2018-02-22 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373123#comment-16373123
 ] 

Steve Rowe commented on SOLR-11968:
---

bq. The posLen of olimpique is 1 but marseille has a posInc of 2. This means 
that there is a hole between olimpique and marseille but posLen doesn't 
indicate this hole.

I think you're wrong, [~jim.ferenczi].

bq. olimpique points to a state that doesn't exist

Aha, this is the crux, I assume: the "state that doesn't exist" isn't actually 
represented by these two attributes, it has to be inferred.  IMHO the 
brokenness here is inability to handle gaps, not in token filters that produce 
them.

posLen (on olimpique) doesn't have to indicate this hole, because it doesn't 
have anything to do with the gap.



bq. I think it's simpler to make sure that stopfilter doesn't break a graph 
like Robert suggested.

AFAICT Robert is suggesting a StopFilter *mode* that would *optionally* remove 
gaps.  IOW its current behavior would remain (and be the default).

> Multi-words query time synonyms
> ---
>
> Key: SOLR-11968
> URL: https://issues.apache.org/jira/browse/SOLR-11968
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers, Schema and Analysis
>Affects Versions: master (8.0), 6.6.2
> Environment: Centos 7.x
>Reporter: Dominique Béjean
>Assignee: Steve Rowe
>Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and 
> SynonymGraphFilterFactory filter as explain in this article
>  
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>   
>  My field type is :
> {code:java}
> 
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>        
>         ignoreCase="true"/>
>        
>      
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>                      ignoreCase="true" expand="true"/>
>        
>         ignoreCase="true"/>
>        
>      
>    {code}
>  
>  synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>  
>  stopwords.txt contains the word 
> {code:java}
> de{code}
>  
>  The order of words in my query has an impact on the generated query in 
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
>  =false
>  =...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the 
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil 
> +name_text_gp:maillot) name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu 
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the 
> same generated query 
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the 
> synonym expansion is ignored, but the second one shows it is not ignored and 
> only the synonym term is used.
>   
>  When I test the analisys for the field type the synonyms are correctly 
> expanded for both expressions
> {code:java}
> om maillot  
>  maillot om
>  olympique de marseille maillot
>  maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always 
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>  
>  So, i suspect an issue with edismax query parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11968) Multi-words query time synonyms

2018-02-22 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373105#comment-16373105
 ] 

Jim Ferenczi commented on SOLR-11968:
-

The posLen of olimpique is 1 but marseille has a posInc of 2. This means that 
there is a hole between olimpique and marseille but posLen doesn't indicate 
this hole. It should be corrected by the stop filter (e.g. setting the posInc 
of marseille to 1 or setting the poslen of olimpique to 2). We could try to 
detect this when we build the graph (olimpique points to a state that doesn't 
exist), that's what TokenStreamToAutomaton does but I don't think it can catch 
all cases. I think it's simpler to make sure that stopfilter doesn't break a 
graph like Robert suggested.

> Multi-words query time synonyms
> ---
>
> Key: SOLR-11968
> URL: https://issues.apache.org/jira/browse/SOLR-11968
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers, Schema and Analysis
>Affects Versions: master (8.0), 6.6.2
> Environment: Centos 7.x
>Reporter: Dominique Béjean
>Assignee: Steve Rowe
>Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and 
> SynonymGraphFilterFactory filter as explain in this article
>  
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>   
>  My field type is :
> {code:java}
> 
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>        
>         ignoreCase="true"/>
>        
>      
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>                      ignoreCase="true" expand="true"/>
>        
>         ignoreCase="true"/>
>        
>      
>    {code}
>  
>  synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>  
>  stopwords.txt contains the word 
> {code:java}
> de{code}
>  
>  The order of words in my query has an impact on the generated query in 
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
>  =false
>  =...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the 
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil 
> +name_text_gp:maillot) name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu 
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the 
> same generated query 
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the 
> synonym expansion is ignored, but the second one shows it is not ignored and 
> only the synonym term is used.
>   
>  When I test the analisys for the field type the synonyms are correctly 
> expanded for both expressions
> {code:java}
> om maillot  
>  maillot om
>  olympique de marseille maillot
>  maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always 
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>  
>  So, i suspect an issue with edismax query parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11968) Multi-words query time synonyms

2018-02-22 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372968#comment-16372968
 ] 

Steve Rowe commented on SOLR-11968:
---

Thanks [~jim.ferenczi], I hadn't seen LUCENE-8137, I'll resolve this issue as a 
duplicate.

bq. The problem here is about broken token stream where the posLength of some 
multi-word synonyms are invalidated by the removal of a token.

I don't think the token streams are always broken?  E.g. for "olimpique de 
marseille" with synonym "om" and "de" as stopword (see this issue's 
description):

|| ||olimpique||om||marseille||
|posinc|1|0|2|
|poslen|1|3|1|

In ^^ , the posLength is not invalidated.  What exactly is broken here?

> Multi-words query time synonyms
> ---
>
> Key: SOLR-11968
> URL: https://issues.apache.org/jira/browse/SOLR-11968
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers, Schema and Analysis
>Affects Versions: master (8.0), 6.6.2
> Environment: Centos 7.x
>Reporter: Dominique Béjean
>Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and 
> SynonymGraphFilterFactory filter as explain in this article
>  
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>   
>  My field type is :
> {code:java}
> 
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>        
>         ignoreCase="true"/>
>        
>      
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>                      ignoreCase="true" expand="true"/>
>        
>         ignoreCase="true"/>
>        
>      
>    {code}
>  
>  synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>  
>  stopwords.txt contains the word 
> {code:java}
> de{code}
>  
>  The order of words in my query has an impact on the generated query in 
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
>  =false
>  =...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the 
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil 
> +name_text_gp:maillot) name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu 
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the 
> same generated query 
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the 
> synonym expansion is ignored, but the second one shows it is not ignored and 
> only the synonym term is used.
>   
>  When I test the analisys for the field type the synonyms are correctly 
> expanded for both expressions
> {code:java}
> om maillot  
>  maillot om
>  olympique de marseille maillot
>  maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always 
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>  
>  So, i suspect an issue with edismax query parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11968) Multi-words query time synonyms

2018-02-21 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372127#comment-16372127
 ] 

Jim Ferenczi commented on SOLR-11968:
-

This issue is also described in 
https://issues.apache.org/jira/browse/LUCENE-8137 . 
https://issues.apache.org/jira/browse/LUCENE-7848 is different, it is about 
adding gaps in the span query produced when multi-words synonym occurs. The 
problem here is about broken token stream where the posLength of some 
multi-word synonyms are invalidated by the removal of a token.  The query 
builder in this case will omit some tokens because posLength is broken for some 
tokens. I like the idea of adding a new mode to StopFilter that updates 
posLength and posInc when needed because I don't think we can "fix" a broken 
token stream outside of the token filter that broke it.

> Multi-words query time synonyms
> ---
>
> Key: SOLR-11968
> URL: https://issues.apache.org/jira/browse/SOLR-11968
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers, Schema and Analysis
>Affects Versions: master (8.0), 6.6.2
> Environment: Centos 7.x
>Reporter: Dominique Béjean
>Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and 
> SynonymGraphFilterFactory filter as explain in this article
>  
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>   
>  My field type is :
> {code:java}
> 
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>        
>         ignoreCase="true"/>
>        
>      
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>                      ignoreCase="true" expand="true"/>
>        
>         ignoreCase="true"/>
>        
>      
>    {code}
>  
>  synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>  
>  stopwords.txt contains the word 
> {code:java}
> de{code}
>  
>  The order of words in my query has an impact on the generated query in 
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
>  =false
>  =...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the 
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil 
> +name_text_gp:maillot) name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu 
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the 
> same generated query 
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the 
> synonym expansion is ignored, but the second one shows it is not ignored and 
> only the synonym term is used.
>   
>  When I test the analisys for the field type the synonyms are correctly 
> expanded for both expressions
> {code:java}
> om maillot  
>  maillot om
>  olympique de marseille maillot
>  maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always 
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>  
>  So, i suspect an issue with edismax query parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11968) Multi-words query time synonyms

2018-02-21 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372030#comment-16372030
 ] 

Steve Rowe commented on SOLR-11968:
---

FYI [~rcmuir] I'm going to copy your comment ^^ over to LUCENE-4065 and comment 
on it there.

> Multi-words query time synonyms
> ---
>
> Key: SOLR-11968
> URL: https://issues.apache.org/jira/browse/SOLR-11968
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers, Schema and Analysis
>Affects Versions: master (8.0), 6.6.2
> Environment: Centos 7.x
>Reporter: Dominique Béjean
>Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and 
> SynonymGraphFilterFactory filter as explain in this article
>  
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>   
>  My field type is :
> {code:java}
> 
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>        
>         ignoreCase="true"/>
>        
>      
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>                      ignoreCase="true" expand="true"/>
>        
>         ignoreCase="true"/>
>        
>      
>    {code}
>  
>  synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>  
>  stopwords.txt contains the word 
> {code:java}
> de{code}
>  
>  The order of words in my query has an impact on the generated query in 
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
>  =false
>  =...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the 
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil 
> +name_text_gp:maillot) name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu 
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the 
> same generated query 
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the 
> synonym expansion is ignored, but the second one shows it is not ignored and 
> only the synonym term is used.
>   
>  When I test the analisys for the field type the synonyms are correctly 
> expanded for both expressions
> {code:java}
> om maillot  
>  maillot om
>  olympique de marseille maillot
>  maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always 
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>  
>  So, i suspect an issue with edismax query parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11968) Multi-words query time synonyms

2018-02-20 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370916#comment-16370916
 ] 

Robert Muir commented on SOLR-11968:


I think the issue is still valid, its a little more complex now because of 
positionLength (means more buffering when you see posLength > 1, because you'll 
need to adjust if you remove something in its path), but the idea is the same: 
give the user a choice between "insert mode" and "replace mode". But this new 
"insert mode" would actually work correctly, correcting posLengths before and 
posIncs after as appropriate. similar to how your editor might have to 
recompute some line breaks/word wrapping and so on.

If you have baseball (length=2), base(length=1), ball(length=1), and you delete 
"base" in this case, you need to change baseball's length to 1 before you omit 
it, because you deleted base. Thats the "buffering before" that would be 
required for posLength. And you still need the same buffering described on the 
issue for posInc=0 that might occur after the fact, so you don't wrongly 
transfer synonyms to different words entirely.

It would be slower than "replace mode" that we have today, but only because of 
the buffering, and I think its pretty contained, but I haven't fully thought it 
thru or tried to write any code.

> Multi-words query time synonyms
> ---
>
> Key: SOLR-11968
> URL: https://issues.apache.org/jira/browse/SOLR-11968
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers, Schema and Analysis
>Affects Versions: master (8.0), 6.6.2
> Environment: Centos 7.x
>Reporter: Dominique Béjean
>Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and 
> SynonymGraphFilterFactory filter as explain in this article
>  
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>   
>  My field type is :
> {code:java}
> 
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>        
>         ignoreCase="true"/>
>        
>      
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>                      ignoreCase="true" expand="true"/>
>        
>         ignoreCase="true"/>
>        
>      
>    {code}
>  
>  synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>  
>  stopwords.txt contains the word 
> {code:java}
> de{code}
>  
>  The order of words in my query has an impact on the generated query in 
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
>  =false
>  =...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the 
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil 
> +name_text_gp:maillot) name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu 
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the 
> same generated query 
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the 
> synonym expansion is ignored, but the second one shows it is not ignored and 
> only the synonym term is used.
>   
>  When I test the analisys for the field type the synonyms are correctly 
> expanded for both expressions
> {code:java}
> om maillot  
>  maillot om
>  olympique de marseille maillot
>  maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always 
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>  
>  So, i suspect an issue with edismax query parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11968) Multi-words query time synonyms

2018-02-20 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370899#comment-16370899
 ] 

Steve Rowe commented on SOLR-11968:
---

bq. LUCENE-4065 should probably be closed as won't-fix (I'll comment there in a 
sec).

Maybe not?  Although the {{enablePositionIncrements()}} option was removed from 
StopFilter et al via LUCENE-4963, Robert Muir wrote that the idea in 
LUCENE-4065 may still have merit: 
[https://discuss.elastic.co/t/stop-filter-problem-enablepositionincrements-false-is-not-supported-anymore-as-of-lucene-4-4-as-it-can-create-broken-token-streams/13457/5]

> Multi-words query time synonyms
> ---
>
> Key: SOLR-11968
> URL: https://issues.apache.org/jira/browse/SOLR-11968
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers, Schema and Analysis
>Affects Versions: master (8.0), 6.6.2
> Environment: Centos 7.x
>Reporter: Dominique Béjean
>Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and 
> SynonymGraphFilterFactory filter as explain in this article
>  
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>   
>  My field type is :
> {code:java}
> 
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>        
>         ignoreCase="true"/>
>        
>      
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>                      ignoreCase="true" expand="true"/>
>        
>         ignoreCase="true"/>
>        
>      
>    {code}
>  
>  synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>  
>  stopwords.txt contains the word 
> {code:java}
> de{code}
>  
>  The order of words in my query has an impact on the generated query in 
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
>  =false
>  =...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the 
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil 
> +name_text_gp:maillot) name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu 
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the 
> same generated query 
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the 
> synonym expansion is ignored, but the second one shows it is not ignored and 
> only the synonym term is used.
>   
>  When I test the analisys for the field type the synonyms are correctly 
> expanded for both expressions
> {code:java}
> om maillot  
>  maillot om
>  olympique de marseille maillot
>  maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always 
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>  
>  So, i suspect an issue with edismax query parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11968) Multi-words query time synonyms

2018-02-20 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370883#comment-16370883
 ] 

Steve Rowe commented on SOLR-11968:
---

bq. I think the root cause is LUCENE-4065. I'll try to make a simple test 
demonstrating this.

Not so - LUCENE-4065 should probably be closed as won't-fix (I'll comment there 
in a sec).

Instead, this looks like the problem described in LUCENE-7848.  I tracked the 
problem down to a bug in Lucene's QueryBuilder, which is dropping tokens in 
side paths with position gaps that are caused by StopFilter.

Below is a test that shows the problem - MockSynonymFilter has synonym "cavy" 
for "guinea pig", and the anonymous analyzer below has "pig" on its 
stopfilter's stoplist.  QueryBuilder produces a query for only "cavy", even 
though the token stream also contains "guinea".

{code:java|title=TestQueryBuilder.java}
  public void testGraphStop() {
Query syn1 = new TermQuery(new Term("field", "guinea"));
Query syn2 = new TermQuery(new Term("field", "cavy"));

BooleanQuery synQuery = new BooleanQuery.Builder()
.add(syn1, BooleanClause.Occur.SHOULD)
.add(syn2, BooleanClause.Occur.SHOULD)
.build();
BooleanQuery expectedGraphQuery = new BooleanQuery.Builder()
.add(synQuery, BooleanClause.Occur.SHOULD)
.build();
QueryBuilder queryBuilder = new QueryBuilder(new Analyzer() {
  @Override
  protected TokenStreamComponents createComponents(String fieldName) {
MockTokenizer tokenizer = new MockTokenizer();
TokenStream stream = new MockSynonymFilter(tokenizer);
stream = new StopFilter(stream, 
CharArraySet.copy(Collections.singleton("pig")));
return new TokenStreamComponents(tokenizer, stream);
  }
});
queryBuilder.setAutoGenerateMultiTermSynonymsPhraseQuery(true);
assertEquals(expectedGraphQuery, queryBuilder.createBooleanQuery("field", 
"guinea pig", BooleanClause.Occur.SHOULD));
  }
}
{code}

> Multi-words query time synonyms
> ---
>
> Key: SOLR-11968
> URL: https://issues.apache.org/jira/browse/SOLR-11968
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers, Schema and Analysis
>Affects Versions: master (8.0), 6.6.2
> Environment: Centos 7.x
>Reporter: Dominique Béjean
>Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and 
> SynonymGraphFilterFactory filter as explain in this article
>  
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>   
>  My field type is :
> {code:java}
> 
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>        
>         ignoreCase="true"/>
>        
>      
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>                      ignoreCase="true" expand="true"/>
>        
>         ignoreCase="true"/>
>        
>      
>    {code}
>  
>  synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>  
>  stopwords.txt contains the word 
> {code:java}
> de{code}
>  
>  The order of words in my query has an impact on the generated query in 
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
>  =false
>  =...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the 
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil 
> +name_text_gp:maillot) name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu 
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the 
> same generated query 
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the 
> synonym expansion is ignored, but the second one shows it is not ignored and 
> only the synonym term is used.
>   
>  When I test the analisys for the field type the synonyms are correctly 
> expanded for both expressions
> {code:java}
> om maillot  
>  maillot om
>  olympique de marseille maillot
>  maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always 
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>  
>  So, i suspect an issue with edismax query parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For 

[jira] [Commented] (SOLR-11968) Multi-words query time synonyms

2018-02-15 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366586#comment-16366586
 ] 

Steve Rowe commented on SOLR-11968:
---

I think the root cause is LUCENE-4065.  I'll try to make a simple test 
demonstrating this.

> Multi-words query time synonyms
> ---
>
> Key: SOLR-11968
> URL: https://issues.apache.org/jira/browse/SOLR-11968
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers, Schema and Analysis
>Affects Versions: master (8.0), 6.6.2
> Environment: Centos 7.x
>Reporter: Dominique Béjean
>Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and 
> SynonymGraphFilterFactory filter as explain in this article
>  
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>   
>  My field type is :
> {code:java}
> 
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>        
>         ignoreCase="true"/>
>        
>      
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>                      ignoreCase="true" expand="true"/>
>        
>         ignoreCase="true"/>
>        
>      
>    {code}
>  
>  synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>  
>  stopwords.txt contains the word 
> {code:java}
> de{code}
>  
>  The order of words in my query has an impact on the generated query in 
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
>  =false
>  =...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the 
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil 
> +name_text_gp:maillot) name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu 
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the 
> same generated query 
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the 
> synonym expansion is ignored, but the second one shows it is not ignored and 
> only the synonym term is used.
>   
>  When I test the analisys for the field type the synonyms are correctly 
> expanded for both expressions
> {code:java}
> om maillot  
>  maillot om
>  olympique de marseille maillot
>  maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always 
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>  
>  So, i suspect an issue with edismax query parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11968) Multi-words query time synonyms

2018-02-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360143#comment-16360143
 ] 

Dominique Béjean commented on SOLR-11968:
-

According to steve's comments, I made this test :

1/ put the SynonymGraphFilterFactory after the StopFilterFactory in query time 
analyze chain
{code:java}

 
 
 
 
 
 
 
{code}
2/ remove the stop word in the synonyms file

om, olympique marseille


The parsed query string are :

for "om maillot"
{code:java}
"parsedquery_toString":"+(+name_text_gp:olympiqu +name_text_gp:marseil) 
name_text_gp:om)) (name_text_gp:maillot))~1)",{code}
for "olympique de marseille maillot"
{code:java}
"parsedquery_toString":"+name_text_gp:om (+name_text_gp:olympiqu 
+name_text_gp:marseil))) (name_text_gp:maillot))~1)",{code}
for "maillot om"
{code:java}
parsedquery_toString":"+(((name_text_gp:maillot) (((+name_text_gp:olympiqu 
+name_text_gp:marseil) name_text_gp:om)))~1)",{code}
for "maillot olympique de marseille" 
{code:java}
"parsedquery_toString":"+(((name_text_gp:maillot) ((name_text_gp:om 
(+name_text_gp:olympiqu +name_text_gp:marseil~1)",{code}

The query result count are also the same for all queries.

> Multi-words query time synonyms
> ---
>
> Key: SOLR-11968
> URL: https://issues.apache.org/jira/browse/SOLR-11968
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers, Schema and Analysis
>Affects Versions: master (8.0), 6.6.2
> Environment: Centos 7.x
>Reporter: Dominique Béjean
>Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and 
> SynonymGraphFilterFactory filter as explain in this article
>  
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>   
>  My field type is :
> {code:java}
> 
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>        
>         ignoreCase="true"/>
>        
>      
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>                      ignoreCase="true" expand="true"/>
>        
>         ignoreCase="true"/>
>        
>      
>    {code}
>  
>  synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>  
>  stopwords.txt contains the word 
> {code:java}
> de{code}
>  
>  The order of words in my query has an impact on the generated query in 
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
>  =false
>  =...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the 
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil 
> +name_text_gp:maillot) name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu 
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the 
> same generated query 
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the 
> synonym expansion is ignored, but the second one shows it is not ignored and 
> only the synonym term is used.
>   
>  When I test the analisys for the field type the synonyms are correctly 
> expanded for both expressions
> {code:java}
> om maillot  
>  maillot om
>  olympique de marseille maillot
>  maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always 
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>  
>  So, i suspect an issue with edismax query parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11968) Multi-words query time synonyms

2018-02-11 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360105#comment-16360105
 ] 

Steve Rowe commented on SOLR-11968:
---

I can see the same behavior on master too, not just on the 
releases/lucene-solr/6.6.2 tag.

One interesting thing I found is that if I remove the stop filter from the 
query analyzer, I get the following for qq=“maillot om”:

+((name_text_gp:maillot) (((+name_text_gp:olympiqu +name_text_gp:de 
+name_text_gp:marseil) name_text_gp:om)))


> Multi-words query time synonyms
> ---
>
> Key: SOLR-11968
> URL: https://issues.apache.org/jira/browse/SOLR-11968
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers, Schema and Analysis
>Affects Versions: master (8.0), 6.6.2
> Environment: Centos 7.x
>Reporter: Dominique Béjean
>Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and 
> SynonymGraphFilterFactory filter as explain in this article
>  
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>   
>  My field type is :
> {code:java}
> 
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>        
>         ignoreCase="true"/>
>        
>      
>      
>        
>                      articles="lang/contractions_fr.txt"/>
>        
>                      ignoreCase="true" expand="true"/>
>        
>         ignoreCase="true"/>
>        
>      
>    {code}
>  
>  synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>  
>  stopwords.txt contains the word 
> {code:java}
> de{code}
>  
>  The order of words in my query has an impact on the generated query in 
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
>  =false
>  =...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the 
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil 
> +name_text_gp:maillot) name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu 
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the 
> same generated query 
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the 
> synonym expansion is ignored, but the second one shows it is not ignored and 
> only the synonym term is used.
>   
>  When I test the analisys for the field type the synonyms are correctly 
> expanded for both expressions
> {code:java}
> om maillot  
>  maillot om
>  olympique de marseille maillot
>  maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always 
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>  
>  So, i suspect an issue with edismax query parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org