[jira] [Comment Edited] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

2018-02-27 Thread Rupert Westenthaler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378763#comment-16378763
 ] 

Rupert Westenthaler edited comment on LUCENE-8183 at 2/27/18 3:27 PM:
--

 Patch: [^LUCENE-8183_20180227_rwesten.diff] 

h3. New Parameters:

* {{noSubMatches}}: true/false
* {{noOverlappingMatches}}: true/false

together with the existing {{onlyLongestMatch}} those can be used to define 
what subwords should be added as tokens. Functionality is as described above.

Typically users will only want to include one of the three attributes as 
enabling {{noOverlappingMatches}} is the most restrictive and {{noSubMatches}} 
is more restrictive as {{onlyLongestMatch}}. When enabling a more restrictive 
option the state of the less restrictive does not have any effect.

Because of that it would be an option to refactor this to an single attribute 
with different setting, but this would require to think about backward 
compatibility for configurations that do use {{onlyLongestMatch=true}} at the 
moment.

h3. Algorithm

If processing of subWords is deactivated (any of {{onlyLongestMatch}},  
{{noSubMatches}}, {{noOverlappingMatches}} is active) the algorithm first 
checks if the token is part of the dictionary. If so it returns immediately. 
This is to avoid adding tokens for subwords if the token itself is in the 
dictionary (see {{#testNoSubAndTokenInDictionary}} for more info).

I changed the iteration direction of the inner {{for}} loop to start with the 
longest possible subword as this simplified the code. 

_NOTE:_ that this also changes the order of the Tokens in the token stream but 
as all tokens are at the same position that should not make any difference. I 
had however to modify some existing tests as those where sensitive to the 
ordering

h3 Tests

I added two test methods in {{TestCompoundWordTokenFilter}}

1. {{#testNoSubAndNoOverlap()}} tests the expected behaviour of the 
{{noSubMatches}} and {{noOverlappingMatches}} options
2. {{#testNoSubAndTokenInDictionary()}} tests that no tokens for subwords are 
added in the case that the token in part of the dictionary

In addition  {{TestHyphenationCompoundWordTokenFilterFactory#testLucene8183()}} 
asserts that the new configuration options are parsed.

h3 Environment

This patch is based on {{master}} from 
{{g...@github.com:apache/lucene-solr.git}} commit: 
{{d512cd7604741a2f55deb0c840188f0342f1ba57}}



was (Author: rwesten):
 Patch: [^LUCENE-8183_20180227_rwesten.diff] 

h3. New Parameters:

* {{noSubMatches}}: true/false
* {{noOverlappingMatches}}: true/false

together with the existing {{onlyLongestMatch}} those can be used to define 
what subwords should be added as tokens. Functionality is as described above.

Typically users will only want to include one of the three attributes as 
enabling {{noOverlappingMatches}} is the most restrictive and {{noSubMatches}} 
is more restrictive as {{onlyLongestMatch}}. When enabling a more restrictive 
option the state of the less restrictive does not have any effect.

Because of that it would be an option to refactor this to an single attribute 
with different setting, but this would require to think about backward 
compatibility for configurations that do use {{onlyLongestMatch=true}} at the 
moment.

h3. Algorithm

If processing of subWords is deactivated (any of {{onlyLongestMatch}},  
{{noSubMatches}}, {{noOverlappingMatches}} is active) the algorithm first 
checks if the token is part of the dictionary. If so it returns immediately. 
This is to avoid adding tokens for subwords if the token itself is in the 
dictionary (see {{#testNoSubAndTokenInDictionary}} for more info).

I changed the iteration direction of the inner {{for}} loop to start with the 
longest possible subword as this simplified the code. 

_NOTE:_ that this also changes the order of the Tokens in the token stream but 
as all tokens are at the same position that should not make any difference. I 
had however to modify some existing tests as those where sensitive to the 
ordering

h3 Tests

I added two test methods in {{TestCompoundWordTokenFilter}}

1. {{#testNoSubAndNoOverlap()}} tests the expected behaviour of the 
{{noSubMatches}} and {{noOverlappingMatches}} options
2. {{#testNoSubAndTokenInDictionary()}} tests that no tokens for subwords are 
added in the case that the token in part of the dictionary

In addition  {{TestHyphenationCompoundWordTokenFilterFactory#testLucene8183()}} 
asserts that the new configuration options are parsed.

h3 Environment

This patch is based on {{master}} from 
{{g...@github.com:apache/lucene-solr.git}}


> HyphenationCompoundWordTokenFilter creates overlapping tokens with 
> onlyLongestMatch enabled
> ---
>
> Key: LUCENE-8183
> URL: 

[jira] [Comment Edited] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

2018-02-23 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375144#comment-16375144
 ] 

Uwe Schindler edited comment on LUCENE-8183 at 2/24/18 12:04 AM:
-

[~rwesten]: I was not aware that this was my dictionary file! The names in your 
example (under "environment in your report) did not look like the example 
listed here: https://github.com/uschindler/german-decompounder


was (Author: thetaphi):
[~rwesten]: I was not aware that this was my dictionary file! The names in your 
example did not look like the example listed here: 
https://github.com/uschindler/german-decompounder

> HyphenationCompoundWordTokenFilter creates overlapping tokens with 
> onlyLongestMatch enabled
> ---
>
> Key: LUCENE-8183
> URL: https://issues.apache.org/jira/browse/LUCENE-8183
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6
> Environment: Configuration of the analyzer:
> 
> 
>          hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
>          dictionary="lang/wordlist_de.txt" 
>         onlyLongestMatch="true"/>
>  
>Reporter: Rupert Westenthaler
>Assignee: Uwe Schindler
>Priority: Major
> Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip
>
>
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if 
> onlyLongestMatch is enabled. 
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
>  Hyphenator: {{de_DR.xml}} //from Apche Offo
>  onlyLongestMatch: true
>  
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 
> 61 66 74]|[73 63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
>  # the 2nd 'gesellschaft' as it duplicates the original token
>  # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the 
> dictionary
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org