[
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378763#comment-16378763
]
Rupert Westenthaler edited comment on LUCENE-8183 at 2/27/18 3:27 PM:
--
Patch: [^LUCENE-8183_20180227_rwesten.diff]
h3. New Parameters:
* {{noSubMatches}}: true/false
* {{noOverlappingMatches}}: true/false
together with the existing {{onlyLongestMatch}} those can be used to define
what subwords should be added as tokens. Functionality is as described above.
Typically users will only want to include one of the three attributes as
enabling {{noOverlappingMatches}} is the most restrictive and {{noSubMatches}}
is more restrictive as {{onlyLongestMatch}}. When enabling a more restrictive
option the state of the less restrictive does not have any effect.
Because of that it would be an option to refactor this to an single attribute
with different setting, but this would require to think about backward
compatibility for configurations that do use {{onlyLongestMatch=true}} at the
moment.
h3. Algorithm
If processing of subWords is deactivated (any of {{onlyLongestMatch}},
{{noSubMatches}}, {{noOverlappingMatches}} is active) the algorithm first
checks if the token is part of the dictionary. If so it returns immediately.
This is to avoid adding tokens for subwords if the token itself is in the
dictionary (see {{#testNoSubAndTokenInDictionary}} for more info).
I changed the iteration direction of the inner {{for}} loop to start with the
longest possible subword as this simplified the code.
_NOTE:_ that this also changes the order of the Tokens in the token stream but
as all tokens are at the same position that should not make any difference. I
had however to modify some existing tests as those where sensitive to the
ordering
h3 Tests
I added two test methods in {{TestCompoundWordTokenFilter}}
1. {{#testNoSubAndNoOverlap()}} tests the expected behaviour of the
{{noSubMatches}} and {{noOverlappingMatches}} options
2. {{#testNoSubAndTokenInDictionary()}} tests that no tokens for subwords are
added in the case that the token in part of the dictionary
In addition {{TestHyphenationCompoundWordTokenFilterFactory#testLucene8183()}}
asserts that the new configuration options are parsed.
h3 Environment
This patch is based on {{master}} from
{{g...@github.com:apache/lucene-solr.git}} commit:
{{d512cd7604741a2f55deb0c840188f0342f1ba57}}
was (Author: rwesten):
Patch: [^LUCENE-8183_20180227_rwesten.diff]
h3. New Parameters:
* {{noSubMatches}}: true/false
* {{noOverlappingMatches}}: true/false
together with the existing {{onlyLongestMatch}} those can be used to define
what subwords should be added as tokens. Functionality is as described above.
Typically users will only want to include one of the three attributes as
enabling {{noOverlappingMatches}} is the most restrictive and {{noSubMatches}}
is more restrictive as {{onlyLongestMatch}}. When enabling a more restrictive
option the state of the less restrictive does not have any effect.
Because of that it would be an option to refactor this to an single attribute
with different setting, but this would require to think about backward
compatibility for configurations that do use {{onlyLongestMatch=true}} at the
moment.
h3. Algorithm
If processing of subWords is deactivated (any of {{onlyLongestMatch}},
{{noSubMatches}}, {{noOverlappingMatches}} is active) the algorithm first
checks if the token is part of the dictionary. If so it returns immediately.
This is to avoid adding tokens for subwords if the token itself is in the
dictionary (see {{#testNoSubAndTokenInDictionary}} for more info).
I changed the iteration direction of the inner {{for}} loop to start with the
longest possible subword as this simplified the code.
_NOTE:_ that this also changes the order of the Tokens in the token stream but
as all tokens are at the same position that should not make any difference. I
had however to modify some existing tests as those where sensitive to the
ordering
h3 Tests
I added two test methods in {{TestCompoundWordTokenFilter}}
1. {{#testNoSubAndNoOverlap()}} tests the expected behaviour of the
{{noSubMatches}} and {{noOverlappingMatches}} options
2. {{#testNoSubAndTokenInDictionary()}} tests that no tokens for subwords are
added in the case that the token in part of the dictionary
In addition {{TestHyphenationCompoundWordTokenFilterFactory#testLucene8183()}}
asserts that the new configuration options are parsed.
h3 Environment
This patch is based on {{master}} from
{{g...@github.com:apache/lucene-solr.git}}
> HyphenationCompoundWordTokenFilter creates overlapping tokens with
> onlyLongestMatch enabled
> ---
>
> Key: LUCENE-8183
> URL: