[ https://issues.apache.org/jira/browse/LUCENE-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15948882#comment-15948882 ]
Mikhail Bystryantsev edited comment on LUCENE-7758 at 3/30/17 11:30 AM: ------------------------------------------------------------------------ {quote}Now imagine that someone is applying edge n-grams on top of synonyms, this could generate broken offsets (going backwards for instance) so keeping the original offsets is the only safe move{quote} But why one feature should break another? I don't use synonyms or something like that, but I have no possibility to use token filter with properly offsets. {quote}A workaround to this issue is to use the (edge) n-gram tokenizers (as opposed to filters){quote} Such workaround applicable only to cases when input text can be simple splitted on specified characters. In my case I want to use {{icu_tokenizer}} before {{edge_ngram}} for properly split by words. For example, imagine japan language. was (Author: mbystryantsev): {quote}Now imagine that someone is applying edge n-grams on top of synonyms, this could generate broken offsets (going backwards for instance) so keeping the original offsets is the only safe move{quote} But why one feature should break another? I don't use synonyms or something like that, but I have no possibility to use token filter with properly offsets. {quote}A workaround to this issue is to use the (edge) n-gram tokenizers (as opposed to filters){quote} Such workaround applicable only to cases when input text can be simple splitted on specified characters. In my case I want to use `icu_tokenizer` before `edge_ngram` for properly split by words. For example, imagine japan language. > EdgeNGramTokenFilter breaks highlighting by keeping end offsets of original > tokens > ---------------------------------------------------------------------------------- > > Key: LUCENE-7758 > URL: https://issues.apache.org/jira/browse/LUCENE-7758 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 6.4.1 > Environment: elasticsearch-5.3 > Reporter: Mikhail Bystryantsev > Labels: EdgeNGramTokenFilter, highlighting > > When EdgeNGramTokenFilter produces new tokens, they inherit end positions > from parent tokens. This behaviour is irrational and breaks highlighting: > highlighted not matched pattern, but whole source tokens. > Seems like similar problem was fixed in LUCENE-3642, but end offsets was > broken again after LUCENE-3907. > Some discussion was found in SOLR-7926: > {quote}I agree this (highlighting of hits from tokens produced by > EdgeNGramFilter) got worse with LUCENE-3907, but it's not clear how to > fix it. > The stacking seems more correct: all these grams are logically > interchangeable with the original token, and were derived from it, so > e.g. a phrase query involving them with adjacent tokens would work > correctly. > We could perhaps remove the token graph requirement that tokens > leaving from the same node have the same startOffset, and arriving to > the same node have the same endOffset. Lucene would still be able to > index such a graph, as long as all tokens leaving a given node are > sorted according to their startOffset. But I'm not sure if there > would be other problems... > Or we could maybe improve the token graph, at least for the non-edge > NGramTokenFilter, so that the grams are linked up correctly, so that any > path through the graph reconstructs the original characters. > But realistically it's not possible to innovate much with token graphs > in Lucene today because of apparently severe back compat requirements: > e.g. LUCENE-6664, which fixes the token graph bugs in the existing > SynonymFilter so that proximity queries work correctly when using > search-time synonyums, is blocked because of the back compat concerns > from LUCENE-6721. > I'm not sure what the path forward is...{quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org