[jira] [Comment Edited] (LUCENE-7758) EdgeNGramTokenFilter breaks highlighting by keeping end offsets of original tokens

Mikhail Bystryantsev (JIRA) Thu, 30 Mar 2017 04:32:08 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15948882#comment-15948882
 ]


Mikhail Bystryantsev edited comment on LUCENE-7758 at 3/30/17 11:30 AM:
------------------------------------------------------------------------

{quote}Now imagine that someone is applying edge n-grams on top of synonyms, 
this could generate broken offsets (going backwards for instance) so keeping 
the original offsets is the only safe move{quote}
But why one feature should break another? I don't use synonyms or something 
like that, but I have no possibility to use token filter with properly offsets.

{quote}A workaround to this issue is to use the (edge) n-gram tokenizers (as 
opposed to filters){quote}
Such workaround applicable only to cases when input text can be simple splitted 
on specified characters. In my case I want to use {{icu_tokenizer}} before 
{{edge_ngram}} for properly split by words. For example, imagine japan language.


was (Author: mbystryantsev):
{quote}Now imagine that someone is applying edge n-grams on top of synonyms, 
this could generate broken offsets (going backwards for instance) so keeping 
the original offsets is the only safe move{quote}
But why one feature should break another? I don't use synonyms or something 
like that, but I have no possibility to use token filter with properly offsets.

{quote}A workaround to this issue is to use the (edge) n-gram tokenizers (as 
opposed to filters){quote}
Such workaround applicable only to cases when input text can be simple splitted 
on specified characters. In my case I want to use `icu_tokenizer` before 
`edge_ngram` for properly split by words. For example, imagine japan language.

> EdgeNGramTokenFilter breaks highlighting by keeping end offsets of original 
> tokens
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-7758
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7758
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 6.4.1
>         Environment: elasticsearch-5.3
>            Reporter: Mikhail Bystryantsev
>              Labels: EdgeNGramTokenFilter, highlighting
>
> When EdgeNGramTokenFilter produces new tokens, they inherit end positions 
> from parent tokens. This behaviour is irrational and breaks highlighting: 
> highlighted not matched pattern, but whole source tokens.
> Seems like similar problem was fixed in LUCENE-3642, but end offsets was 
> broken again after LUCENE-3907.
> Some discussion was found in SOLR-7926:
> {quote}I agree this (highlighting of hits from tokens produced by
> EdgeNGramFilter) got worse with LUCENE-3907, but it's not clear how to
> fix it.
> The stacking seems more correct: all these grams are logically
> interchangeable with the original token, and were derived from it, so
> e.g. a phrase query involving them with adjacent tokens would work
> correctly.
> We could perhaps remove the token graph requirement that tokens
> leaving from the same node have the same startOffset, and arriving to
> the same node have the same endOffset. Lucene would still be able to
> index such a graph, as long as all tokens leaving a given node are
> sorted according to their startOffset. But I'm not sure if there
> would be other problems...
> Or we could maybe improve the token graph, at least for the non-edge
> NGramTokenFilter, so that the grams are linked up correctly, so that any
> path through the graph reconstructs the original characters.
> But realistically it's not possible to innovate much with token graphs
> in Lucene today because of apparently severe back compat requirements:
> e.g. LUCENE-6664, which fixes the token graph bugs in the existing
> SynonymFilter so that proximity queries work correctly when using
> search-time synonyums, is blocked because of the back compat concerns
> from LUCENE-6721.
> I'm not sure what the path forward is...{quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-7758) EdgeNGramTokenFilter breaks highlighting by keeping end offsets of original tokens

Reply via email to