Re[2]: SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage

Александр Шестак Mon, 05 Feb 2018 22:31:42 -0800

Hi, thank you for your explanation.
I have one more question related to this topic.
I have changed my schema in next way (replaced SynonymGraphFilterFactory with 
SynonymFilterFactory):
<fieldtype name="fulltext_en" class="solr.TextField" 
autoGeneratePhraseQueries="true">
   <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1" generateNumberParts="1" splitOnNumerics="1"
catenateWords="1" catenateNumbers="1" catenateAll="0" preserveOriginal="1" 
protected="protwords_en.txt"/>
      <filter class="solr.FlattenGraphFilterFactory"/>
   </analyzer>
   <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1" generateNumberParts="1" splitOnNumerics="1"
catenateWords="0" catenateNumbers="0" catenateAll="0" preserveOriginal="1" 
protected="protwords_en.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.SynonymFilterFactory"
synonyms="synonyms_en.txt" ignoreCase="true" expand="true"/>
   </analyzer>
</fieldtype>
Now I have another strange issue.
If I have configured synonyms in next way 
b=>b,boron
2=>ii,2
Then for query "my_field:b2" parsedQuery looks so "my_field:b2 
Synonym(my_field:2 my_field:ii)"
But when I changed synonyms to 
b,boron
ii,2
Then for query "my_field:b2" parsedQuery looks so "my_field:b2 my_field:\"b 2\" 
my_field:\"b ii\" my_field:\"boron 2\" my_field:\"boron ii\")"
The second query is correct (it uses synonyms for two parts after word split). 
May be somebody can explain why synonym behavior depends on kind of synonym 
mappings?
And generally is it correct to use SynonymFilterFactory after 
WordDelimiterGraphFilterFactory? We can't use two graph filters together but in 
another way I am forced to use deprecated SynonymFilterFactory?



>Понедельник,  5 февраля 2018, 19:27 +03:00 от Steve Rowe < sar...@gmail.com >:
>
>Hi Александр,
>
>> On Feb 5, 2018, at 11:19 AM, Shawn Heisey <  apa...@elyograg.org > wrote:
>> 
>> There should be no problem with using them together.
>
>I believe Shawn is wrong.
>
>From <  
>http://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymGraphFilter.html
> >:
>
>> NOTE: this cannot consume an incoming graph; results will be undefined.
>
>Unfortunately, the ref guide entry for Synonym Graph Filter <  
>https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#synonym-graph-filter
> > doesn’t include a warning about this, but it should, like the warning on 
>Word Delimiter Graph Filter <  
>https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#word-delimiter-graph-filter
> >:
>
>> Note: although this filter produces correct token graphs, it cannot consume 
>> an input token graph correctly.
>
>(I’ve just committed a change to the ref guide source to add this also on the 
>Synonym Graph Filter and Managed Synonym Graph Filter entries, to be included 
>in the ref guide for Solr 7.3.)
>
>In short, the combination of the two filters is not supported, because WDGF 
>produces a token graph, which SGF cannot correctly interpret.
>
>Other filters also have this issue, see e.g. <  
>https://issues.apache.org/jira/browse/LUCENE-3475 > for ShingleFilter; this 
>issue has gotten some attention recently, and hopefully it will inspire fixes 
>elsewhere.
>
>Patches welcome!
>
>--
>Steve
> www.lucidworks.com
>
>
>> On Feb 5, 2018, at 11:19 AM, Shawn Heisey <  apa...@elyograg.org > wrote:
>> 
>> On 2/5/2018 3:55 AM, Александр Шестак wrote:
>>> 
>>> Hi, I have misunderstanding about usage of SynonymGraphFilterFactory
>>> and  WordDelimiterGraphFilterFactory. Can they be used together?
>>> 
>> 
>> There should be no problem with using them together.  But it is always
>> possible that the behavior will surprise you, while working 100% as
>> designed.
>> 
>>> I have solr type configured in next way
>>> 
>>> <fieldtype name="fulltext_en" class="solr.TextField"
>>> autoGeneratePhraseQueries="true">
>>>   <analyzer type="index">
>>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>     <filter class="solr.WordDelimiterGraphFilterFactory"
>>>             generateWordParts="1" generateNumberParts="1"
>>> splitOnNumerics="1"
>>>             catenateWords="1" catenateNumbers="1" catenateAll="0"
>>> preserveOriginal="1" protected="protwords_en.txt"/>
>>>     <filter class="solr.FlattenGraphFilterFactory"/>
>>>   </analyzer>
>>>   <analyzer type="query">
>>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>     <filter class="solr.WordDelimiterGraphFilterFactory"
>>>             generateWordParts="1" generateNumberParts="1"
>>> splitOnNumerics="1"
>>>             catenateWords="0" catenateNumbers="0" catenateAll="0"
>>> preserveOriginal="1" protected="protwords_en.txt"/>
>>>     <filter class="solr.LowerCaseFilterFactory"/>
>>>     <filter class="solr.SynonymGraphFilterFactory"
>>>             synonyms="synonyms_en.txt" ignoreCase="true" expand="true"/>
>>>   </analyzer>
>>> </fieldtype>
>>> 
>>> So on query time it uses SynonymGraphFilterFactory after
>>> WordDelimiterGraphFilterFactory.
>>> Synonyms are configured in next way:
>>> b=>b,boron
>>> 2=>ii,2
>>> 
>>> Query in solr analysis tool looks so. It is shown that terms after SGF
>>> have positions 3 and 4. Is it correct? I thought that they should had
>>> 1 and 2 positions.
>>> 
>> 
>> What matters is the *relative* positions.  The exact position number
>> doesn't matter much.  Something new that the Graph implementations use
>> is the position length.  That feature is necessary for multi-term
>> synonyms to function correctly in phrase queries.
>> 
>> In your analysis screenshot, WDGF creates three tokens.  The two tokens
>> created by splitting the input are at positions 1 and 2, which I think
>> is 100% as expected.  It also sets the positionLength of the first term
>> to 2, probably because it has split that term into 2 additional terms.
>> 
>> Then the SGF takes those last two terms and expands them.  Each of the
>> synonyms is at the same position as the original term, and the relative
>> positions of the two synonym pairs have not changed -- the second one is
>> still one higher than the first.  I think the reason that SGF moves the
>> positions two higher is because the positionLength on the "b2" term is
>> 2, previously set by WDGF.  Someone with more knowledge about the Graph
>> implementations may have to speak up as to whether this behavior is correct.
>> 
>> Because the relative positions of the split terms don't change when SGF
>> runs, I think this is probably working as designed.
>> 
>> Thanks,
>> Shawn
>


--

Re[2]: SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage

Reply via email to