[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

2018-02-27 Thread Rupert Westenthaler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378763#comment-16378763
 ] 

Rupert Westenthaler commented on LUCENE-8183:
-

 Patch: [^LUCENE-8183_20180227_rwesten.diff] 

h3. New Parameters:

* {{noSubMatches}}: true/false
* {{noOverlappingMatches}}: true/false

together with the existing {{onlyLongestMatch}} those can be used to define 
what subwords should be added as tokens. Functionality is as described above.

Typically users will only want to include one of the three attributes as 
enabling {{noOverlappingMatches}} is the most restrictive and {{noSubMatches}} 
is more restrictive as {{onlyLongestMatch}}. When enabling a more restrictive 
option the state of the less restrictive does not have any effect.

Because of that it would be an option to refactor this to an single attribute 
with different setting, but this would require to think about backward 
compatibility for configurations that do use {{onlyLongestMatch=true}} at the 
moment.

h3. Algorithm

If processing of subWords is deactivated (any of {{onlyLongestMatch}},  
{{noSubMatches}}, {{noOverlappingMatches}} is active) the algorithm first 
checks if the token is part of the dictionary. If so it returns immediately. 
This is to avoid adding tokens for subwords if the token itself is in the 
dictionary (see {{#testNoSubAndTokenInDictionary}} for more info).

I changed the iteration direction of the inner {{for}} loop to start with the 
longest possible subword as this simplified the code. 

_NOTE:_ that this also changes the order of the Tokens in the token stream but 
as all tokens are at the same position that should not make any difference. I 
had however to modify some existing tests as those where sensitive to the 
ordering

h3 Tests

I added two test methods in {{TestCompoundWordTokenFilter}}

1. {{#testNoSubAndNoOverlap()}} tests the expected behaviour of the 
{{noSubMatches}} and {{noOverlappingMatches}} options
2. {{#testNoSubAndTokenInDictionary()}} tests that no tokens for subwords are 
added in the case that the token in part of the dictionary

In addition  {{TestHyphenationCompoundWordTokenFilterFactory#testLucene8183()}} 
asserts that the new configuration options are parsed.

h3 Environment

This patch is based on {{master}} from 
{{g...@github.com:apache/lucene-solr.git}}


> HyphenationCompoundWordTokenFilter creates overlapping tokens with 
> onlyLongestMatch enabled
> ---
>
> Key: LUCENE-8183
> URL: https://issues.apache.org/jira/browse/LUCENE-8183
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6
> Environment: Configuration of the analyzer:
> 
> 
>          hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
>          dictionary="lang/wordlist_de.txt" 
>         onlyLongestMatch="true"/>
>  
>Reporter: Rupert Westenthaler
>Assignee: Uwe Schindler
>Priority: Major
> Attachments: LUCENE-8183_20180223_rwesten.diff, 
> LUCENE-8183_20180227_rwesten.diff, lucene-8183.zip
>
>
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if 
> onlyLongestMatch is enabled. 
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
>  Hyphenator: {{de_DR.xml}} //from Apche Offo
>  onlyLongestMatch: true
>  
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 
> 61 66 74]|[73 63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
>  # the 2nd 'gesellschaft' as it duplicates the original token
>  # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the 
> dictionary
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

2018-02-26 Thread Rupert Westenthaler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16376887#comment-16376887
 ] 

Rupert Westenthaler commented on LUCENE-8183:
-

Thats helpful indeed! thx

> HyphenationCompoundWordTokenFilter creates overlapping tokens with 
> onlyLongestMatch enabled
> ---
>
> Key: LUCENE-8183
> URL: https://issues.apache.org/jira/browse/LUCENE-8183
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6
> Environment: Configuration of the analyzer:
> 
> 
>          hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
>          dictionary="lang/wordlist_de.txt" 
>         onlyLongestMatch="true"/>
>  
>Reporter: Rupert Westenthaler
>Assignee: Uwe Schindler
>Priority: Major
> Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip
>
>
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if 
> onlyLongestMatch is enabled. 
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
>  Hyphenator: {{de_DR.xml}} //from Apche Offo
>  onlyLongestMatch: true
>  
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 
> 61 66 74]|[73 63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
>  # the 2nd 'gesellschaft' as it duplicates the original token
>  # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the 
> dictionary
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

2018-02-26 Thread Matthias Krueger (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16376826#comment-16376826
 ] 

Matthias Krueger commented on LUCENE-8183:
--

You might want to have a look at "mocking" the HyphenationTree (see my patch 
for LUCENE-8185) which simplifies writing a decompounding test.

> HyphenationCompoundWordTokenFilter creates overlapping tokens with 
> onlyLongestMatch enabled
> ---
>
> Key: LUCENE-8183
> URL: https://issues.apache.org/jira/browse/LUCENE-8183
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6
> Environment: Configuration of the analyzer:
> 
> 
>          hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
>          dictionary="lang/wordlist_de.txt" 
>         onlyLongestMatch="true"/>
>  
>Reporter: Rupert Westenthaler
>Assignee: Uwe Schindler
>Priority: Major
> Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip
>
>
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if 
> onlyLongestMatch is enabled. 
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
>  Hyphenator: {{de_DR.xml}} //from Apche Offo
>  onlyLongestMatch: true
>  
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 
> 61 66 74]|[73 63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
>  # the 2nd 'gesellschaft' as it duplicates the original token
>  # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the 
> dictionary
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

2018-02-26 Thread Rupert Westenthaler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16376748#comment-16376748
 ] 

Rupert Westenthaler commented on LUCENE-8183:
-

FYI: I pan to spend some time to implement a version of the 
DictionaryCompoundWordTokenFilter that adds options for

* `noSub`: no tokens are added the are completely enclosed by an longer 
(`fußballpumpe`: `fußball`, `ballpumpe`)
* `noOverlap`: no overlapping tokens (`fußballpumpe`; `fußball`, `pumpe`)

IMO the simplest way is to first emit all tokens and later filter those based 
on the active options (`onlyLongestMatch`, `noSub`, `noOverlap`).


Regarding the test:

Providing good test examples is hard as the current test cases are based on a 
Danish and I do not speak this language
Providing examples in German would be easy, but this would require a German 
hyphenator and the file is licensed under the LaTeX Project Public License and 
can therefore not be included in the source.
Given suitable examples the implementation of the actual test seams to be 
rather easy as they can be implemented similar to the existing test cases

> HyphenationCompoundWordTokenFilter creates overlapping tokens with 
> onlyLongestMatch enabled
> ---
>
> Key: LUCENE-8183
> URL: https://issues.apache.org/jira/browse/LUCENE-8183
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6
> Environment: Configuration of the analyzer:
> 
> 
>          hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
>          dictionary="lang/wordlist_de.txt" 
>         onlyLongestMatch="true"/>
>  
>Reporter: Rupert Westenthaler
>Assignee: Uwe Schindler
>Priority: Major
> Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip
>
>
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if 
> onlyLongestMatch is enabled. 
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
>  Hyphenator: {{de_DR.xml}} //from Apche Offo
>  onlyLongestMatch: true
>  
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 
> 61 66 74]|[73 63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
>  # the 2nd 'gesellschaft' as it duplicates the original token
>  # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the 
> dictionary
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

2018-02-23 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375154#comment-16375154
 ] 

Uwe Schindler commented on LUCENE-8183:
---

I have to check the algorithm, but to make this patch into lucene, the test 
cases need to be adapted to check this new behaviour.

> HyphenationCompoundWordTokenFilter creates overlapping tokens with 
> onlyLongestMatch enabled
> ---
>
> Key: LUCENE-8183
> URL: https://issues.apache.org/jira/browse/LUCENE-8183
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6
> Environment: Configuration of the analyzer:
> 
> 
>          hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
>          dictionary="lang/wordlist_de.txt" 
>         onlyLongestMatch="true"/>
>  
>Reporter: Rupert Westenthaler
>Assignee: Uwe Schindler
>Priority: Major
> Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip
>
>
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if 
> onlyLongestMatch is enabled. 
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
>  Hyphenator: {{de_DR.xml}} //from Apche Offo
>  onlyLongestMatch: true
>  
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 
> 61 66 74]|[73 63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
>  # the 2nd 'gesellschaft' as it duplicates the original token
>  # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the 
> dictionary
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

2018-02-23 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375147#comment-16375147
 ] 

Uwe Schindler commented on LUCENE-8183:
---

bq. I am aware of this possibility. In fact I do use the 
RemoveDuplicatesTokenFilter to remove those tokens. My point was just why they 
are added in the first place.

I think it's good to not add them in the first place. The change is quite 
simple, so it can be done here. And it does not really complicate the algorithm 
as its done at one separated place.

> HyphenationCompoundWordTokenFilter creates overlapping tokens with 
> onlyLongestMatch enabled
> ---
>
> Key: LUCENE-8183
> URL: https://issues.apache.org/jira/browse/LUCENE-8183
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6
> Environment: Configuration of the analyzer:
> 
> 
>          hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
>          dictionary="lang/wordlist_de.txt" 
>         onlyLongestMatch="true"/>
>  
>Reporter: Rupert Westenthaler
>Assignee: Uwe Schindler
>Priority: Major
> Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip
>
>
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if 
> onlyLongestMatch is enabled. 
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
>  Hyphenator: {{de_DR.xml}} //from Apche Offo
>  onlyLongestMatch: true
>  
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 
> 61 66 74]|[73 63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
>  # the 2nd 'gesellschaft' as it duplicates the original token
>  # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the 
> dictionary
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

2018-02-23 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375144#comment-16375144
 ] 

Uwe Schindler commented on LUCENE-8183:
---

[~rwesten]: I was not aware that this was my dictionary file! The names in your 
example did not look like the example listed here: 
https://github.com/uschindler/german-decompounder

> HyphenationCompoundWordTokenFilter creates overlapping tokens with 
> onlyLongestMatch enabled
> ---
>
> Key: LUCENE-8183
> URL: https://issues.apache.org/jira/browse/LUCENE-8183
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6
> Environment: Configuration of the analyzer:
> 
> 
>          hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
>          dictionary="lang/wordlist_de.txt" 
>         onlyLongestMatch="true"/>
>  
>Reporter: Rupert Westenthaler
>Assignee: Uwe Schindler
>Priority: Major
> Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip
>
>
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if 
> onlyLongestMatch is enabled. 
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
>  Hyphenator: {{de_DR.xml}} //from Apche Offo
>  onlyLongestMatch: true
>  
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 
> 61 66 74]|[73 63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
>  # the 2nd 'gesellschaft' as it duplicates the original token
>  # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the 
> dictionary
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

2018-02-23 Thread Rupert Westenthaler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374916#comment-16374916
 ] 

Rupert Westenthaler commented on LUCENE-8183:
-

I was not aware that this is the intended behaviour.

{quote}For "Fußballpumpe" and dictionary "Ball", "Ballpumpe", "Pumpe", "Fuß", 
"Fußball" you would get the tokens "Fußball" and "pumpe" but not "Ballpumpe" as 
"Ball" has already been considered part of Fußball. Also, not sure if your 
change also improves the situation for languages other than German.{quote}

Thats a good point. Maybe one should still consider parts that are not enclosed 
by an token that was already decomposed. So for {{Fußballpumpe}}: {{ball}} 
would be ignored as {{{Fußball}} is already present, but {{ballpumpe}} would 
still be added as token. Finally {{pumpe}} is ignored as {{ballpumpe}} is 
present.

This reminds me to {{ALL}}, {{NO_SUB}} and {{LONGEST_DOMINANT_RIGHT}} as 
supported by the [Solr Text 
Tagger|https://github.com/OpenSextant/SolrTextTagger#the-tagger-request-time-parameters-are]

{quote}
Perhaps these kind of adjustments should rather be done in a TokenFilter 
similar to RemoveDuplicatesTokenFilter instead of complicating the 
decompounding algorithm?
{quote}
I am aware of this possibility. In fact I do use the 
{{RemoveDuplicatesTokenFilter}} to remove those tokens. My point was just why 
they are added in the first place.

> HyphenationCompoundWordTokenFilter creates overlapping tokens with 
> onlyLongestMatch enabled
> ---
>
> Key: LUCENE-8183
> URL: https://issues.apache.org/jira/browse/LUCENE-8183
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6
> Environment: Configuration of the analyzer:
> 
> 
>          hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
>          dictionary="lang/wordlist_de.txt" 
>         onlyLongestMatch="true"/>
>  
>Reporter: Rupert Westenthaler
>Assignee: Uwe Schindler
>Priority: Major
> Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip
>
>
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if 
> onlyLongestMatch is enabled. 
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
>  Hyphenator: {{de_DR.xml}} //from Apche Offo
>  onlyLongestMatch: true
>  
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 
> 61 66 74]|[73 63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
>  # the 2nd 'gesellschaft' as it duplicates the original token
>  # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the 
> dictionary
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

2018-02-23 Thread Rupert Westenthaler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374888#comment-16374888
 ] 

Rupert Westenthaler commented on LUCENE-8183:
-

AFAIK I use exactly this dictionary and hyphenator config. I will provide a 
Solr core config that can be used to reproduce the described issue.

> HyphenationCompoundWordTokenFilter creates overlapping tokens with 
> onlyLongestMatch enabled
> ---
>
> Key: LUCENE-8183
> URL: https://issues.apache.org/jira/browse/LUCENE-8183
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6
> Environment: Configuration of the analyzer:
> 
> 
>          hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
>          dictionary="lang/wordlist_de.txt" 
>         onlyLongestMatch="true"/>
>  
>Reporter: Rupert Westenthaler
>Assignee: Uwe Schindler
>Priority: Major
> Attachments: LUCENE-8183_20180223_rwesten.diff
>
>
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if 
> onlyLongestMatch is enabled. 
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
>  Hyphenator: {{de_DR.xml}} //from Apche Offo
>  onlyLongestMatch: true
>  
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 
> 61 66 74]|[73 63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
>  # the 2nd 'gesellschaft' as it duplicates the original token
>  # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the 
> dictionary
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

2018-02-23 Thread Matthias Krueger (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374886#comment-16374886
 ] 

Matthias Krueger commented on LUCENE-8183:
--

[~rwesten] Quick question regarding your patch: What's the reasoning behind not 
decomposing terms that are part of the dictionary at all?

The {{onlyLongestMatch}} flag currently affects whether all matches or only the 
longest match should be returned *per* *start* character (in 
DictionaryCompoundWordTokenFilter) or *per* hyphenation *start* point (in 
HyphenationCompoundWordTokenFilter).

Example:
 Dictionary {{"Schaft", "Wirt", "Wirtschaft", "Wissen", "Wissenschaft"}} for 
input "Wirtschaftswissenschaft" will return the original input plus tokens 
"Wirtschaft", "schaft", "wissenschaft", "schaft" but not "Wirt" or "Wissen". 
"schaft" is still returned (even twice) because it's the longest token starting 
at the respective position.

I like the idea of restricting this further to only the longest terms that 
*touch* a certain hyphenation point. This would exclude "schaft" in the example 
above (as "Wirtschaft" and "wissenschaft" are two longer terms encompassing the 
respective hyphenation point). On the other hand, there might be examples where 
you still want to include the "overlapping" tokens. For "Fußballpumpe" and 
dictionary {{"Ball", "Ballpumpe", "Pumpe", "Fuß", "Fußball"}} you would get the 
tokens "Fußball" and "pumpe" but not "Ballpumpe" as "Ball" has already been 
considered part of Fußball. Also, not sure if your change also improves the 
situation for languages other than German.

Regarding point 1: The current algorithm always returns the term itself again 
if it's part of the dictionary. I guess, this could be changed if we don't 
check against {{this.maxSubwordSize}} but against 
{{Math.min(this.maxSubwordSize), termAtt.length()-1)}}

Perhaps these kind of adjustments should rather be done in a TokenFilter 
similar to RemoveDuplicatesTokenFilter instead of complicating the 
decompounding algorithm?

> HyphenationCompoundWordTokenFilter creates overlapping tokens with 
> onlyLongestMatch enabled
> ---
>
> Key: LUCENE-8183
> URL: https://issues.apache.org/jira/browse/LUCENE-8183
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6
> Environment: Configuration of the analyzer:
> 
> 
>          hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
>          dictionary="lang/wordlist_de.txt" 
>         onlyLongestMatch="true"/>
>  
>Reporter: Rupert Westenthaler
>Assignee: Uwe Schindler
>Priority: Major
> Attachments: LUCENE-8183_20180223_rwesten.diff
>
>
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if 
> onlyLongestMatch is enabled. 
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
>  Hyphenator: {{de_DR.xml}} //from Apche Offo
>  onlyLongestMatch: true
>  
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 
> 61 66 74]|[73 63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
>  # the 2nd 'gesellschaft' as it duplicates the original token
>  # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the 
> dictionary
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

2018-02-23 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374753#comment-16374753
 ] 

Uwe Schindler commented on LUCENE-8183:
---

Hi,
I have not noticed this with my dictionaries, I have to dig into that. Did you 
also check the German dictionary which is provided here: 
https://github.com/uschindler/german-decompounder

> HyphenationCompoundWordTokenFilter creates overlapping tokens with 
> onlyLongestMatch enabled
> ---
>
> Key: LUCENE-8183
> URL: https://issues.apache.org/jira/browse/LUCENE-8183
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6
> Environment: Configuration of the analyzer:
> 
> 
>          hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
>          dictionary="lang/wordlist_de.txt" 
>         onlyLongestMatch="true"/>
>  
>Reporter: Rupert Westenthaler
>Priority: Major
> Attachments: LUCENE-8183_20180223_rwesten.diff
>
>
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if 
> onlyLongestMatch is enabled. 
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
>  Hyphenator: {{de_DR.xml}} //from Apche Offo
>  onlyLongestMatch: true
>  
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 
> 61 66 74]|[73 63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
>  # the 2nd 'gesellschaft' as it duplicates the original token
>  # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the 
> dictionary
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

2018-02-23 Thread Rupert Westenthaler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374237#comment-16374237
 ] 

Rupert Westenthaler commented on LUCENE-8183:
-

Added a patch that shows how the unexpected behaviour could be fixed.

> HyphenationCompoundWordTokenFilter creates overlapping tokens with 
> onlyLongestMatch enabled
> ---
>
> Key: LUCENE-8183
> URL: https://issues.apache.org/jira/browse/LUCENE-8183
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6
> Environment: Configuration of the analyzer:
> 
> 
>          hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
>          dictionary="lang/wordlist_de.txt" 
>         onlyLongestMatch="true"/>
>  
>Reporter: Rupert Westenthaler
>Priority: Major
> Attachments: LUCENE-8183_20180223_rwesten.diff
>
>
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if 
> onlyLongestMatch is enabled. 
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
>  Hyphenator: {{de_DR.xml}} //from Apche Offo
>  onlyLongestMatch: true
>  
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 
> 61 66 74]|[73 63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
>  # the 2nd 'gesellschaft' as it duplicates the original token
>  # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the 
> dictionary
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org