[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled
[ https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378763#comment-16378763 ] Rupert Westenthaler commented on LUCENE-8183: - Patch: [^LUCENE-8183_20180227_rwesten.diff] h3. New Parameters: * {{noSubMatches}}: true/false * {{noOverlappingMatches}}: true/false together with the existing {{onlyLongestMatch}} those can be used to define what subwords should be added as tokens. Functionality is as described above. Typically users will only want to include one of the three attributes as enabling {{noOverlappingMatches}} is the most restrictive and {{noSubMatches}} is more restrictive as {{onlyLongestMatch}}. When enabling a more restrictive option the state of the less restrictive does not have any effect. Because of that it would be an option to refactor this to an single attribute with different setting, but this would require to think about backward compatibility for configurations that do use {{onlyLongestMatch=true}} at the moment. h3. Algorithm If processing of subWords is deactivated (any of {{onlyLongestMatch}}, {{noSubMatches}}, {{noOverlappingMatches}} is active) the algorithm first checks if the token is part of the dictionary. If so it returns immediately. This is to avoid adding tokens for subwords if the token itself is in the dictionary (see {{#testNoSubAndTokenInDictionary}} for more info). I changed the iteration direction of the inner {{for}} loop to start with the longest possible subword as this simplified the code. _NOTE:_ that this also changes the order of the Tokens in the token stream but as all tokens are at the same position that should not make any difference. I had however to modify some existing tests as those where sensitive to the ordering h3 Tests I added two test methods in {{TestCompoundWordTokenFilter}} 1. {{#testNoSubAndNoOverlap()}} tests the expected behaviour of the {{noSubMatches}} and {{noOverlappingMatches}} options 2. {{#testNoSubAndTokenInDictionary()}} tests that no tokens for subwords are added in the case that the token in part of the dictionary In addition {{TestHyphenationCompoundWordTokenFilterFactory#testLucene8183()}} asserts that the new configuration options are parsed. h3 Environment This patch is based on {{master}} from {{g...@github.com:apache/lucene-solr.git}} > HyphenationCompoundWordTokenFilter creates overlapping tokens with > onlyLongestMatch enabled > --- > > Key: LUCENE-8183 > URL: https://issues.apache.org/jira/browse/LUCENE-8183 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6 > Environment: Configuration of the analyzer: > > > hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1" > dictionary="lang/wordlist_de.txt" > onlyLongestMatch="true"/> > >Reporter: Rupert Westenthaler >Assignee: Uwe Schindler >Priority: Major > Attachments: LUCENE-8183_20180223_rwesten.diff, > LUCENE-8183_20180227_rwesten.diff, lucene-8183.zip > > > The HyphenationCompoundWordTokenFilter creates overlapping tokens even if > onlyLongestMatch is enabled. > Example: > Dictionary: {{gesellschaft}}, {{schaft}} > Hyphenator: {{de_DR.xml}} //from Apche Offo > onlyLongestMatch: true > > |text|gesellschaft|gesellschaft|schaft| > |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 > 61 66 74]|[73 63 68 61 66 74]| > |start|0|0|0| > |end|12|12|12| > |positionLength|1|1|1| > |type|word|word|word| > |position|1|1|1| > IMHO this includes 2 unexpected Tokens > # the 2nd 'gesellschaft' as it duplicates the original token > # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the > dictionary > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled
[ https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16376887#comment-16376887 ] Rupert Westenthaler commented on LUCENE-8183: - Thats helpful indeed! thx > HyphenationCompoundWordTokenFilter creates overlapping tokens with > onlyLongestMatch enabled > --- > > Key: LUCENE-8183 > URL: https://issues.apache.org/jira/browse/LUCENE-8183 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6 > Environment: Configuration of the analyzer: > > > hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1" > dictionary="lang/wordlist_de.txt" > onlyLongestMatch="true"/> > >Reporter: Rupert Westenthaler >Assignee: Uwe Schindler >Priority: Major > Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip > > > The HyphenationCompoundWordTokenFilter creates overlapping tokens even if > onlyLongestMatch is enabled. > Example: > Dictionary: {{gesellschaft}}, {{schaft}} > Hyphenator: {{de_DR.xml}} //from Apche Offo > onlyLongestMatch: true > > |text|gesellschaft|gesellschaft|schaft| > |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 > 61 66 74]|[73 63 68 61 66 74]| > |start|0|0|0| > |end|12|12|12| > |positionLength|1|1|1| > |type|word|word|word| > |position|1|1|1| > IMHO this includes 2 unexpected Tokens > # the 2nd 'gesellschaft' as it duplicates the original token > # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the > dictionary > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled
[ https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16376826#comment-16376826 ] Matthias Krueger commented on LUCENE-8183: -- You might want to have a look at "mocking" the HyphenationTree (see my patch for LUCENE-8185) which simplifies writing a decompounding test. > HyphenationCompoundWordTokenFilter creates overlapping tokens with > onlyLongestMatch enabled > --- > > Key: LUCENE-8183 > URL: https://issues.apache.org/jira/browse/LUCENE-8183 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6 > Environment: Configuration of the analyzer: > > > hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1" > dictionary="lang/wordlist_de.txt" > onlyLongestMatch="true"/> > >Reporter: Rupert Westenthaler >Assignee: Uwe Schindler >Priority: Major > Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip > > > The HyphenationCompoundWordTokenFilter creates overlapping tokens even if > onlyLongestMatch is enabled. > Example: > Dictionary: {{gesellschaft}}, {{schaft}} > Hyphenator: {{de_DR.xml}} //from Apche Offo > onlyLongestMatch: true > > |text|gesellschaft|gesellschaft|schaft| > |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 > 61 66 74]|[73 63 68 61 66 74]| > |start|0|0|0| > |end|12|12|12| > |positionLength|1|1|1| > |type|word|word|word| > |position|1|1|1| > IMHO this includes 2 unexpected Tokens > # the 2nd 'gesellschaft' as it duplicates the original token > # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the > dictionary > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled
[ https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16376748#comment-16376748 ] Rupert Westenthaler commented on LUCENE-8183: - FYI: I pan to spend some time to implement a version of the DictionaryCompoundWordTokenFilter that adds options for * `noSub`: no tokens are added the are completely enclosed by an longer (`fußballpumpe`: `fußball`, `ballpumpe`) * `noOverlap`: no overlapping tokens (`fußballpumpe`; `fußball`, `pumpe`) IMO the simplest way is to first emit all tokens and later filter those based on the active options (`onlyLongestMatch`, `noSub`, `noOverlap`). Regarding the test: Providing good test examples is hard as the current test cases are based on a Danish and I do not speak this language Providing examples in German would be easy, but this would require a German hyphenator and the file is licensed under the LaTeX Project Public License and can therefore not be included in the source. Given suitable examples the implementation of the actual test seams to be rather easy as they can be implemented similar to the existing test cases > HyphenationCompoundWordTokenFilter creates overlapping tokens with > onlyLongestMatch enabled > --- > > Key: LUCENE-8183 > URL: https://issues.apache.org/jira/browse/LUCENE-8183 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6 > Environment: Configuration of the analyzer: > > > hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1" > dictionary="lang/wordlist_de.txt" > onlyLongestMatch="true"/> > >Reporter: Rupert Westenthaler >Assignee: Uwe Schindler >Priority: Major > Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip > > > The HyphenationCompoundWordTokenFilter creates overlapping tokens even if > onlyLongestMatch is enabled. > Example: > Dictionary: {{gesellschaft}}, {{schaft}} > Hyphenator: {{de_DR.xml}} //from Apche Offo > onlyLongestMatch: true > > |text|gesellschaft|gesellschaft|schaft| > |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 > 61 66 74]|[73 63 68 61 66 74]| > |start|0|0|0| > |end|12|12|12| > |positionLength|1|1|1| > |type|word|word|word| > |position|1|1|1| > IMHO this includes 2 unexpected Tokens > # the 2nd 'gesellschaft' as it duplicates the original token > # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the > dictionary > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled
[ https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375154#comment-16375154 ] Uwe Schindler commented on LUCENE-8183: --- I have to check the algorithm, but to make this patch into lucene, the test cases need to be adapted to check this new behaviour. > HyphenationCompoundWordTokenFilter creates overlapping tokens with > onlyLongestMatch enabled > --- > > Key: LUCENE-8183 > URL: https://issues.apache.org/jira/browse/LUCENE-8183 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6 > Environment: Configuration of the analyzer: > > > hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1" > dictionary="lang/wordlist_de.txt" > onlyLongestMatch="true"/> > >Reporter: Rupert Westenthaler >Assignee: Uwe Schindler >Priority: Major > Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip > > > The HyphenationCompoundWordTokenFilter creates overlapping tokens even if > onlyLongestMatch is enabled. > Example: > Dictionary: {{gesellschaft}}, {{schaft}} > Hyphenator: {{de_DR.xml}} //from Apche Offo > onlyLongestMatch: true > > |text|gesellschaft|gesellschaft|schaft| > |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 > 61 66 74]|[73 63 68 61 66 74]| > |start|0|0|0| > |end|12|12|12| > |positionLength|1|1|1| > |type|word|word|word| > |position|1|1|1| > IMHO this includes 2 unexpected Tokens > # the 2nd 'gesellschaft' as it duplicates the original token > # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the > dictionary > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled
[ https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375147#comment-16375147 ] Uwe Schindler commented on LUCENE-8183: --- bq. I am aware of this possibility. In fact I do use the RemoveDuplicatesTokenFilter to remove those tokens. My point was just why they are added in the first place. I think it's good to not add them in the first place. The change is quite simple, so it can be done here. And it does not really complicate the algorithm as its done at one separated place. > HyphenationCompoundWordTokenFilter creates overlapping tokens with > onlyLongestMatch enabled > --- > > Key: LUCENE-8183 > URL: https://issues.apache.org/jira/browse/LUCENE-8183 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6 > Environment: Configuration of the analyzer: > > > hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1" > dictionary="lang/wordlist_de.txt" > onlyLongestMatch="true"/> > >Reporter: Rupert Westenthaler >Assignee: Uwe Schindler >Priority: Major > Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip > > > The HyphenationCompoundWordTokenFilter creates overlapping tokens even if > onlyLongestMatch is enabled. > Example: > Dictionary: {{gesellschaft}}, {{schaft}} > Hyphenator: {{de_DR.xml}} //from Apche Offo > onlyLongestMatch: true > > |text|gesellschaft|gesellschaft|schaft| > |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 > 61 66 74]|[73 63 68 61 66 74]| > |start|0|0|0| > |end|12|12|12| > |positionLength|1|1|1| > |type|word|word|word| > |position|1|1|1| > IMHO this includes 2 unexpected Tokens > # the 2nd 'gesellschaft' as it duplicates the original token > # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the > dictionary > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled
[ https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375144#comment-16375144 ] Uwe Schindler commented on LUCENE-8183: --- [~rwesten]: I was not aware that this was my dictionary file! The names in your example did not look like the example listed here: https://github.com/uschindler/german-decompounder > HyphenationCompoundWordTokenFilter creates overlapping tokens with > onlyLongestMatch enabled > --- > > Key: LUCENE-8183 > URL: https://issues.apache.org/jira/browse/LUCENE-8183 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6 > Environment: Configuration of the analyzer: > > > hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1" > dictionary="lang/wordlist_de.txt" > onlyLongestMatch="true"/> > >Reporter: Rupert Westenthaler >Assignee: Uwe Schindler >Priority: Major > Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip > > > The HyphenationCompoundWordTokenFilter creates overlapping tokens even if > onlyLongestMatch is enabled. > Example: > Dictionary: {{gesellschaft}}, {{schaft}} > Hyphenator: {{de_DR.xml}} //from Apche Offo > onlyLongestMatch: true > > |text|gesellschaft|gesellschaft|schaft| > |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 > 61 66 74]|[73 63 68 61 66 74]| > |start|0|0|0| > |end|12|12|12| > |positionLength|1|1|1| > |type|word|word|word| > |position|1|1|1| > IMHO this includes 2 unexpected Tokens > # the 2nd 'gesellschaft' as it duplicates the original token > # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the > dictionary > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled
[ https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374916#comment-16374916 ] Rupert Westenthaler commented on LUCENE-8183: - I was not aware that this is the intended behaviour. {quote}For "Fußballpumpe" and dictionary "Ball", "Ballpumpe", "Pumpe", "Fuß", "Fußball" you would get the tokens "Fußball" and "pumpe" but not "Ballpumpe" as "Ball" has already been considered part of Fußball. Also, not sure if your change also improves the situation for languages other than German.{quote} Thats a good point. Maybe one should still consider parts that are not enclosed by an token that was already decomposed. So for {{Fußballpumpe}}: {{ball}} would be ignored as {{{Fußball}} is already present, but {{ballpumpe}} would still be added as token. Finally {{pumpe}} is ignored as {{ballpumpe}} is present. This reminds me to {{ALL}}, {{NO_SUB}} and {{LONGEST_DOMINANT_RIGHT}} as supported by the [Solr Text Tagger|https://github.com/OpenSextant/SolrTextTagger#the-tagger-request-time-parameters-are] {quote} Perhaps these kind of adjustments should rather be done in a TokenFilter similar to RemoveDuplicatesTokenFilter instead of complicating the decompounding algorithm? {quote} I am aware of this possibility. In fact I do use the {{RemoveDuplicatesTokenFilter}} to remove those tokens. My point was just why they are added in the first place. > HyphenationCompoundWordTokenFilter creates overlapping tokens with > onlyLongestMatch enabled > --- > > Key: LUCENE-8183 > URL: https://issues.apache.org/jira/browse/LUCENE-8183 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6 > Environment: Configuration of the analyzer: > > > hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1" > dictionary="lang/wordlist_de.txt" > onlyLongestMatch="true"/> > >Reporter: Rupert Westenthaler >Assignee: Uwe Schindler >Priority: Major > Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip > > > The HyphenationCompoundWordTokenFilter creates overlapping tokens even if > onlyLongestMatch is enabled. > Example: > Dictionary: {{gesellschaft}}, {{schaft}} > Hyphenator: {{de_DR.xml}} //from Apche Offo > onlyLongestMatch: true > > |text|gesellschaft|gesellschaft|schaft| > |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 > 61 66 74]|[73 63 68 61 66 74]| > |start|0|0|0| > |end|12|12|12| > |positionLength|1|1|1| > |type|word|word|word| > |position|1|1|1| > IMHO this includes 2 unexpected Tokens > # the 2nd 'gesellschaft' as it duplicates the original token > # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the > dictionary > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled
[ https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374888#comment-16374888 ] Rupert Westenthaler commented on LUCENE-8183: - AFAIK I use exactly this dictionary and hyphenator config. I will provide a Solr core config that can be used to reproduce the described issue. > HyphenationCompoundWordTokenFilter creates overlapping tokens with > onlyLongestMatch enabled > --- > > Key: LUCENE-8183 > URL: https://issues.apache.org/jira/browse/LUCENE-8183 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6 > Environment: Configuration of the analyzer: > > > hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1" > dictionary="lang/wordlist_de.txt" > onlyLongestMatch="true"/> > >Reporter: Rupert Westenthaler >Assignee: Uwe Schindler >Priority: Major > Attachments: LUCENE-8183_20180223_rwesten.diff > > > The HyphenationCompoundWordTokenFilter creates overlapping tokens even if > onlyLongestMatch is enabled. > Example: > Dictionary: {{gesellschaft}}, {{schaft}} > Hyphenator: {{de_DR.xml}} //from Apche Offo > onlyLongestMatch: true > > |text|gesellschaft|gesellschaft|schaft| > |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 > 61 66 74]|[73 63 68 61 66 74]| > |start|0|0|0| > |end|12|12|12| > |positionLength|1|1|1| > |type|word|word|word| > |position|1|1|1| > IMHO this includes 2 unexpected Tokens > # the 2nd 'gesellschaft' as it duplicates the original token > # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the > dictionary > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled
[ https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374886#comment-16374886 ] Matthias Krueger commented on LUCENE-8183: -- [~rwesten] Quick question regarding your patch: What's the reasoning behind not decomposing terms that are part of the dictionary at all? The {{onlyLongestMatch}} flag currently affects whether all matches or only the longest match should be returned *per* *start* character (in DictionaryCompoundWordTokenFilter) or *per* hyphenation *start* point (in HyphenationCompoundWordTokenFilter). Example: Dictionary {{"Schaft", "Wirt", "Wirtschaft", "Wissen", "Wissenschaft"}} for input "Wirtschaftswissenschaft" will return the original input plus tokens "Wirtschaft", "schaft", "wissenschaft", "schaft" but not "Wirt" or "Wissen". "schaft" is still returned (even twice) because it's the longest token starting at the respective position. I like the idea of restricting this further to only the longest terms that *touch* a certain hyphenation point. This would exclude "schaft" in the example above (as "Wirtschaft" and "wissenschaft" are two longer terms encompassing the respective hyphenation point). On the other hand, there might be examples where you still want to include the "overlapping" tokens. For "Fußballpumpe" and dictionary {{"Ball", "Ballpumpe", "Pumpe", "Fuß", "Fußball"}} you would get the tokens "Fußball" and "pumpe" but not "Ballpumpe" as "Ball" has already been considered part of Fußball. Also, not sure if your change also improves the situation for languages other than German. Regarding point 1: The current algorithm always returns the term itself again if it's part of the dictionary. I guess, this could be changed if we don't check against {{this.maxSubwordSize}} but against {{Math.min(this.maxSubwordSize), termAtt.length()-1)}} Perhaps these kind of adjustments should rather be done in a TokenFilter similar to RemoveDuplicatesTokenFilter instead of complicating the decompounding algorithm? > HyphenationCompoundWordTokenFilter creates overlapping tokens with > onlyLongestMatch enabled > --- > > Key: LUCENE-8183 > URL: https://issues.apache.org/jira/browse/LUCENE-8183 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6 > Environment: Configuration of the analyzer: > > > hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1" > dictionary="lang/wordlist_de.txt" > onlyLongestMatch="true"/> > >Reporter: Rupert Westenthaler >Assignee: Uwe Schindler >Priority: Major > Attachments: LUCENE-8183_20180223_rwesten.diff > > > The HyphenationCompoundWordTokenFilter creates overlapping tokens even if > onlyLongestMatch is enabled. > Example: > Dictionary: {{gesellschaft}}, {{schaft}} > Hyphenator: {{de_DR.xml}} //from Apche Offo > onlyLongestMatch: true > > |text|gesellschaft|gesellschaft|schaft| > |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 > 61 66 74]|[73 63 68 61 66 74]| > |start|0|0|0| > |end|12|12|12| > |positionLength|1|1|1| > |type|word|word|word| > |position|1|1|1| > IMHO this includes 2 unexpected Tokens > # the 2nd 'gesellschaft' as it duplicates the original token > # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the > dictionary > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled
[ https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374753#comment-16374753 ] Uwe Schindler commented on LUCENE-8183: --- Hi, I have not noticed this with my dictionaries, I have to dig into that. Did you also check the German dictionary which is provided here: https://github.com/uschindler/german-decompounder > HyphenationCompoundWordTokenFilter creates overlapping tokens with > onlyLongestMatch enabled > --- > > Key: LUCENE-8183 > URL: https://issues.apache.org/jira/browse/LUCENE-8183 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6 > Environment: Configuration of the analyzer: > > > hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1" > dictionary="lang/wordlist_de.txt" > onlyLongestMatch="true"/> > >Reporter: Rupert Westenthaler >Priority: Major > Attachments: LUCENE-8183_20180223_rwesten.diff > > > The HyphenationCompoundWordTokenFilter creates overlapping tokens even if > onlyLongestMatch is enabled. > Example: > Dictionary: {{gesellschaft}}, {{schaft}} > Hyphenator: {{de_DR.xml}} //from Apche Offo > onlyLongestMatch: true > > |text|gesellschaft|gesellschaft|schaft| > |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 > 61 66 74]|[73 63 68 61 66 74]| > |start|0|0|0| > |end|12|12|12| > |positionLength|1|1|1| > |type|word|word|word| > |position|1|1|1| > IMHO this includes 2 unexpected Tokens > # the 2nd 'gesellschaft' as it duplicates the original token > # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the > dictionary > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled
[ https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374237#comment-16374237 ] Rupert Westenthaler commented on LUCENE-8183: - Added a patch that shows how the unexpected behaviour could be fixed. > HyphenationCompoundWordTokenFilter creates overlapping tokens with > onlyLongestMatch enabled > --- > > Key: LUCENE-8183 > URL: https://issues.apache.org/jira/browse/LUCENE-8183 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6 > Environment: Configuration of the analyzer: > > > hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1" > dictionary="lang/wordlist_de.txt" > onlyLongestMatch="true"/> > >Reporter: Rupert Westenthaler >Priority: Major > Attachments: LUCENE-8183_20180223_rwesten.diff > > > The HyphenationCompoundWordTokenFilter creates overlapping tokens even if > onlyLongestMatch is enabled. > Example: > Dictionary: {{gesellschaft}}, {{schaft}} > Hyphenator: {{de_DR.xml}} //from Apche Offo > onlyLongestMatch: true > > |text|gesellschaft|gesellschaft|schaft| > |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 > 61 66 74]|[73 63 68 61 66 74]| > |start|0|0|0| > |end|12|12|12| > |positionLength|1|1|1| > |type|word|word|word| > |position|1|1|1| > IMHO this includes 2 unexpected Tokens > # the 2nd 'gesellschaft' as it duplicates the original token > # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the > dictionary > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org