[jira] [Commented] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

2018-12-04 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708502#comment-16708502
 ] 

ASF subversion and git services commented on LUCENE-8509:
-

Commit 75a053dd696d6e632755e613380450f22c78c91b in lucene-solr's branch 
refs/heads/master from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=75a053d ]

LUCENE-8509: WordDelimiterGraphFilter no longer adjusts offsets by default


> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can 
> produce backwards offsets
> 
>
> Key: LUCENE-8509
> URL: https://issues.apache.org/jira/browse/LUCENE-8509
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8509.patch, LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here: 
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the 
> beginning of the second token).  The WDGF takes the first token and splits it 
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and 
> "b"[2,3].  The trim filter removes the leading space from the second token, 
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space 
> has already been stripped, WDGF sees no need to adjust offsets, and emits the 
> token as-is, resulting in the start offsets of the tokenstream being [0, 2, 
> 1], and the IndexWriter rejecting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

2018-12-03 Thread Alan Woodward (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707100#comment-16707100
 ] 

Alan Woodward commented on LUCENE-8509:
---

I plan on committing this in the next couple of days

> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can 
> produce backwards offsets
> 
>
> Key: LUCENE-8509
> URL: https://issues.apache.org/jira/browse/LUCENE-8509
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8509.patch, LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here: 
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the 
> beginning of the second token).  The WDGF takes the first token and splits it 
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and 
> "b"[2,3].  The trim filter removes the leading space from the second token, 
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space 
> has already been stripped, WDGF sees no need to adjust offsets, and emits the 
> token as-is, resulting in the start offsets of the tokenstream being [0, 2, 
> 1], and the IndexWriter rejecting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

2018-11-19 Thread Alan Woodward (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691884#comment-16691884
 ] 

Alan Woodward commented on LUCENE-8509:
---

Here's an updated patch that allows you to conditionally disable WDGF's 
offset-mangling, defaulting to "no mangling".  This should also allow us to 
plug it back into TestRandomChains, while disallowing the constructor that 
permits turning the mangling back on again.

> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can 
> produce backwards offsets
> 
>
> Key: LUCENE-8509
> URL: https://issues.apache.org/jira/browse/LUCENE-8509
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8509.patch, LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here: 
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the 
> beginning of the second token).  The WDGF takes the first token and splits it 
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and 
> "b"[2,3].  The trim filter removes the leading space from the second token, 
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space 
> has already been stripped, WDGF sees no need to adjust offsets, and emits the 
> token as-is, resulting in the start offsets of the tokenstream being [0, 2, 
> 1], and the IndexWriter rejecting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

2018-10-29 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667406#comment-16667406
 ] 

Michael Gibney commented on LUCENE-8509:


I'd echo [~dsmiley]'s comment over at LUCENE-8516 – "I don't see the big deal 
in a token filter doing tokenization. I see it has certain challenges but don't 
think it's fundamentally wrong".

A special case of the "not-so-crazy" idea proposed above would have WDGF remain 
a {{TokenFilter}}, but require it to be configured to take input directly from 
a {{Tokenizer}} (as opposed to more general {{TokenStream}}). I think this 
would be functionally equivalent to the change proposed at LUCENE-8516. This 
special case would obviate the need for tracking whether there exists a 1:1 
correspondence between input offsets and token text, because such 
correspondence should (?) always exist immediately after the {{Tokenizer}}. 
This approach (or the slightly more general/elaborate "not-so-crazy" approach 
described above) might also address [~rcmuir]'s observation at LUCENE-8516 that 
the {{WordDelimiterTokenizer}} could be viewed as "still a tokenfilter in 
disguise".

As a side note, the configuration referenced in the title and description of 
this issue doesn't particularly well illustrate the more general problem, 
because the problem with this configuration could be equally well addressed by 
causing {{TrimFilter}} to update offsets, or (I think with no affect on 
intended behavior) by simply reordering filters so that {{TrimFilter}} comes 
after WDGF.

> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can 
> produce backwards offsets
> 
>
> Key: LUCENE-8509
> URL: https://issues.apache.org/jira/browse/LUCENE-8509
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here: 
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the 
> beginning of the second token).  The WDGF takes the first token and splits it 
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and 
> "b"[2,3].  The trim filter removes the leading space from the second token, 
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space 
> has already been stripped, WDGF sees no need to adjust offsets, and emits the 
> token as-is, resulting in the start offsets of the tokenstream being [0, 2, 
> 1], and the IndexWriter rejecting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

2018-10-28 Thread Alan Woodward (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1403#comment-1403
 ] 

Alan Woodward commented on LUCENE-8509:
---

> WDGF is playing the role of a tokenizer

This is the root of all its problems though, really.  It ought to be a 
tokenizer: token filters shouldn't be changing offsets at all, because it's too 
easy to end up with offsets going backwards.  I have a separate issue 
(LUCENE-8516) to make WDF a Tokenizer, but it's going to be a much more 
complicated change, and I think this is a reasonable short-term solution.

> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can 
> produce backwards offsets
> 
>
> Key: LUCENE-8509
> URL: https://issues.apache.org/jira/browse/LUCENE-8509
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here: 
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the 
> beginning of the second token).  The WDGF takes the first token and splits it 
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and 
> "b"[2,3].  The trim filter removes the leading space from the second token, 
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space 
> has already been stripped, WDGF sees no need to adjust offsets, and emits the 
> token as-is, resulting in the start offsets of the tokenstream being [0, 2, 
> 1], and the IndexWriter rejecting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

2018-10-26 Thread Mike Sokolov (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16665429#comment-16665429
 ] 

Mike Sokolov commented on LUCENE-8509:
--

[ from mailing list – sorry for the duplication ]

The current situation is that it is impossible to apply offsets correctly in a 
TokenFilter. It seems to work OK most of the time, but truly correct behavior 
relies on prior components in the chain not having altered the length of 
tokens, which some of them occasionally do. For complete correctness in this 
area, I believe there are only really two possibilities: one is to stop trying 
to provide offsets in token filters, as in this issue, and the other would be 
to add some mechanism for allowing token filters to access the "correct" 
offset.  Well I guess we could try to prevent token filters from adding or 
removing characters, but that seems like a nonstarter for a lot of reasons. I 
put up a patch that allows for correct offsetting, but I think there was some 
consensus, and I am coming around to this position, that the amount of API 
change was not justified by the pretty minor benefit of having accurate 
within-token highlighting.

So I am +1 to this patch.

> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can 
> produce backwards offsets
> 
>
> Key: LUCENE-8509
> URL: https://issues.apache.org/jira/browse/LUCENE-8509
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here: 
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the 
> beginning of the second token).  The WDGF takes the first token and splits it 
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and 
> "b"[2,3].  The trim filter removes the leading space from the second token, 
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space 
> has already been stripped, WDGF sees no need to adjust offsets, and emits the 
> token as-is, resulting in the start offsets of the tokenstream being [0, 2, 
> 1], and the IndexWriter rejecting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Commented] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

2018-10-26 Thread Michael Gibney
Ah, I see -- thanks, Michael. To make sure I understand correctly, this
particular case (with this particular order of analysis components) *would*
in fact be fixed by causing TrimFilter to update offsets. But for the sake
of argument, if we had some filter *before* TrimFilter that for some reason
*added* an extra leading space, then TrimFilter would have no way of
knowing whether to update the startOffset by +1 (correct) or +2 (incorrect,
but probably the most likely way to implement). Or a less contrived
example: if you applied SynonymGraphFilter before WDGF (which would seem
weird, but could happen) that would break all correspondence between the
token text and the input offsets, and *any* manipulation of offsets by WDGF
would be based on the false assumption of such a correspondence.

I think that makes me also +1 for Alan's suggestion.

While we're at it though, thinking ahead a little more about "figure out
how to do it correctly", I can think of only 2 possibilities, each
requiring an extra Attribute, and one of the possibilities is crazy:

The crazy idea: have an Attribute that maps each input character offset to
a corresponding character position in the token text ... but actually I
don't think that would even work, so nevermind.

The not-so-crazy idea: have a boolean Attribute that tracks whether there
is a 1:1 correspondence between the input offsets and the token text. Any
TokenFilter doing the kind of manipulation that *would* affect offsets
could check for the presence of this Attribute (which would default to
false), and iff present and true, could update offsets. I think that should
be robust, and could leave the behavior of a lot of existing configurations
unchanged (since TrimFilter, WDGF, and the like are often applied early in
the analysis chain); this would also avoid the need to potentially modify
tests for highlighting, etc...

Michael

On Fri, Oct 26, 2018 at 9:10 AM Michael Sokolov  wrote:

> In case it wasn't clear, I am +1 for Alan's plan. We can always restore
> offset-alterations here if at some future date we figure out how to do it
> correctly.
>
> On Fri, Oct 26, 2018 at 6:08 AM Michael Sokolov 
> wrote:
>
>> The current situation is that it is impossible to apply offsets correctly
>> in a TokenFilter. It seems to work OK most of the time, but truly correct
>> behavior relies on prior components in the chain not having altered the
>> length of tokens, which some of them occasionally do. For complete
>> correctness in this area, I believe there are only really two
>> possibilities: one is to stop trying to provide offsets in token filters,
>> as in this issue, and the other would be to add some mechanism for allowing
>> token filters to access the "correct" offset.  Well I guess we could try to
>> prevent token filters from adding or removing characters, but that seems
>> like a nonstarter for a lot of reasons. I put up a patch that allows for
>> correct offsetting, but I think there was some consensus, and I am coming
>> around to this position, that the amount of API change was not justified by
>> the pretty minor benefit of having accurate within-token highlighting.
>>
>> On Wed, Oct 24, 2018 at 10:40 PM Michael Gibney (JIRA) 
>> wrote:
>>
>>>
>>> [
>>> https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663115#comment-16663115
>>> ]
>>>
>>> Michael Gibney commented on LUCENE-8509:
>>> 
>>>
>>> > The trim filter removes the leading space from the second token,
>>> leaving offsets unchanged, so WDGF sees "bb"[1,4];
>>>
>>> If I understand correctly what [~dsmiley] is saying, then to put it
>>> another way: doesn't this look more like an issue with {{TrimFilter}}? If
>>> WDGF sees as input from {{TrimFilter}} "bb"[1,4] (instead of " bb"[1,4] or
>>> "bb"[2,4]), then it's handling the input correctly, but the input is wrong.
>>>
>>> "because tokenization splits offsets and WDGF is playing the role of a
>>> tokenizer" -- this behavior is notably different from what
>>> {{SynonymGraphFilter}} does (adding externally-specified alternate
>>> representations of input tokens). Offsets are really only meaningful with
>>> respect to input, and new tokens introduced by WDGF are directly derived
>>> from input, while new tokens introduced by {{SynonymGraphFilter}} are not
>>> and thus can _only_ inherit offsets of the input token.
>>>
>>> > NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination
>>> can produce backwards offsets
>>> >
>>> 
>>> >
>>> > Key: LUCENE-8509
>>> > URL: https://issues.apache.org/jira/browse/LUCENE-8509
>>> > Project: Lucene - Core
>>> >  Issue Type: Task
>>> >Reporter: Alan Woodward
>>> >Assignee: Alan Woodward
>>> >Priority: Major
>>> > 

Re: [jira] [Commented] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

2018-10-26 Thread Michael Sokolov
In case it wasn't clear, I am +1 for Alan's plan. We can always restore
offset-alterations here if at some future date we figure out how to do it
correctly.

On Fri, Oct 26, 2018 at 6:08 AM Michael Sokolov  wrote:

> The current situation is that it is impossible to apply offsets correctly
> in a TokenFilter. It seems to work OK most of the time, but truly correct
> behavior relies on prior components in the chain not having altered the
> length of tokens, which some of them occasionally do. For complete
> correctness in this area, I believe there are only really two
> possibilities: one is to stop trying to provide offsets in token filters,
> as in this issue, and the other would be to add some mechanism for allowing
> token filters to access the "correct" offset.  Well I guess we could try to
> prevent token filters from adding or removing characters, but that seems
> like a nonstarter for a lot of reasons. I put up a patch that allows for
> correct offsetting, but I think there was some consensus, and I am coming
> around to this position, that the amount of API change was not justified by
> the pretty minor benefit of having accurate within-token highlighting.
>
> On Wed, Oct 24, 2018 at 10:40 PM Michael Gibney (JIRA) 
> wrote:
>
>>
>> [
>> https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663115#comment-16663115
>> ]
>>
>> Michael Gibney commented on LUCENE-8509:
>> 
>>
>> > The trim filter removes the leading space from the second token,
>> leaving offsets unchanged, so WDGF sees "bb"[1,4];
>>
>> If I understand correctly what [~dsmiley] is saying, then to put it
>> another way: doesn't this look more like an issue with {{TrimFilter}}? If
>> WDGF sees as input from {{TrimFilter}} "bb"[1,4] (instead of " bb"[1,4] or
>> "bb"[2,4]), then it's handling the input correctly, but the input is wrong.
>>
>> "because tokenization splits offsets and WDGF is playing the role of a
>> tokenizer" -- this behavior is notably different from what
>> {{SynonymGraphFilter}} does (adding externally-specified alternate
>> representations of input tokens). Offsets are really only meaningful with
>> respect to input, and new tokens introduced by WDGF are directly derived
>> from input, while new tokens introduced by {{SynonymGraphFilter}} are not
>> and thus can _only_ inherit offsets of the input token.
>>
>> > NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination
>> can produce backwards offsets
>> >
>> 
>> >
>> > Key: LUCENE-8509
>> > URL: https://issues.apache.org/jira/browse/LUCENE-8509
>> > Project: Lucene - Core
>> >  Issue Type: Task
>> >Reporter: Alan Woodward
>> >Assignee: Alan Woodward
>> >Priority: Major
>> > Attachments: LUCENE-8509.patch
>> >
>> >
>> > Discovered by an elasticsearch user and described here:
>> https://github.com/elastic/elasticsearch/issues/33710
>> > The ngram tokenizer produces tokens "a b" and " bb" (note the space at
>> the beginning of the second token).  The WDGF takes the first token and
>> splits it into two, adjusting the offsets of the second token, so we get
>> "a"[0,1] and "b"[2,3].  The trim filter removes the leading space from the
>> second token, leaving offsets unchanged, so WDGF sees "bb"[1,4]; because
>> the leading space has already been stripped, WDGF sees no need to adjust
>> offsets, and emits the token as-is, resulting in the start offsets of the
>> tokenstream being [0, 2, 1], and the IndexWriter rejecting it.
>>
>>
>>
>> --
>> This message was sent by Atlassian JIRA
>> (v7.6.3#76005)
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>


Re: [jira] [Commented] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

2018-10-26 Thread Michael Sokolov
The current situation is that it is impossible to apply offsets correctly
in a TokenFilter. It seems to work OK most of the time, but truly correct
behavior relies on prior components in the chain not having altered the
length of tokens, which some of them occasionally do. For complete
correctness in this area, I believe there are only really two
possibilities: one is to stop trying to provide offsets in token filters,
as in this issue, and the other would be to add some mechanism for allowing
token filters to access the "correct" offset.  Well I guess we could try to
prevent token filters from adding or removing characters, but that seems
like a nonstarter for a lot of reasons. I put up a patch that allows for
correct offsetting, but I think there was some consensus, and I am coming
around to this position, that the amount of API change was not justified by
the pretty minor benefit of having accurate within-token highlighting.

On Wed, Oct 24, 2018 at 10:40 PM Michael Gibney (JIRA) 
wrote:

>
> [
> https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663115#comment-16663115
> ]
>
> Michael Gibney commented on LUCENE-8509:
> 
>
> > The trim filter removes the leading space from the second token, leaving
> offsets unchanged, so WDGF sees "bb"[1,4];
>
> If I understand correctly what [~dsmiley] is saying, then to put it
> another way: doesn't this look more like an issue with {{TrimFilter}}? If
> WDGF sees as input from {{TrimFilter}} "bb"[1,4] (instead of " bb"[1,4] or
> "bb"[2,4]), then it's handling the input correctly, but the input is wrong.
>
> "because tokenization splits offsets and WDGF is playing the role of a
> tokenizer" -- this behavior is notably different from what
> {{SynonymGraphFilter}} does (adding externally-specified alternate
> representations of input tokens). Offsets are really only meaningful with
> respect to input, and new tokens introduced by WDGF are directly derived
> from input, while new tokens introduced by {{SynonymGraphFilter}} are not
> and thus can _only_ inherit offsets of the input token.
>
> > NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination
> can produce backwards offsets
> >
> 
> >
> > Key: LUCENE-8509
> > URL: https://issues.apache.org/jira/browse/LUCENE-8509
> > Project: Lucene - Core
> >  Issue Type: Task
> >Reporter: Alan Woodward
> >Assignee: Alan Woodward
> >Priority: Major
> > Attachments: LUCENE-8509.patch
> >
> >
> > Discovered by an elasticsearch user and described here:
> https://github.com/elastic/elasticsearch/issues/33710
> > The ngram tokenizer produces tokens "a b" and " bb" (note the space at
> the beginning of the second token).  The WDGF takes the first token and
> splits it into two, adjusting the offsets of the second token, so we get
> "a"[0,1] and "b"[2,3].  The trim filter removes the leading space from the
> second token, leaving offsets unchanged, so WDGF sees "bb"[1,4]; because
> the leading space has already been stripped, WDGF sees no need to adjust
> offsets, and emits the token as-is, resulting in the start offsets of the
> tokenstream being [0, 2, 1], and the IndexWriter rejecting it.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


[jira] [Commented] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

2018-10-24 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663115#comment-16663115
 ] 

Michael Gibney commented on LUCENE-8509:


> The trim filter removes the leading space from the second token, leaving 
> offsets unchanged, so WDGF sees "bb"[1,4]; 

If I understand correctly what [~dsmiley] is saying, then to put it another 
way: doesn't this look more like an issue with {{TrimFilter}}? If WDGF sees as 
input from {{TrimFilter}} "bb"[1,4] (instead of " bb"[1,4] or "bb"[2,4]), then 
it's handling the input correctly, but the input is wrong.

"because tokenization splits offsets and WDGF is playing the role of a 
tokenizer" -- this behavior is notably different from what 
{{SynonymGraphFilter}} does (adding externally-specified alternate 
representations of input tokens). Offsets are really only meaningful with 
respect to input, and new tokens introduced by WDGF are directly derived from 
input, while new tokens introduced by {{SynonymGraphFilter}} are not and thus 
can _only_ inherit offsets of the input token.

> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can 
> produce backwards offsets
> 
>
> Key: LUCENE-8509
> URL: https://issues.apache.org/jira/browse/LUCENE-8509
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here: 
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the 
> beginning of the second token).  The WDGF takes the first token and splits it 
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and 
> "b"[2,3].  The trim filter removes the leading space from the second token, 
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space 
> has already been stripped, WDGF sees no need to adjust offsets, and emits the 
> token as-is, resulting in the start offsets of the tokenstream being [0, 2, 
> 1], and the IndexWriter rejecting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

2018-10-24 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662897#comment-16662897
 ] 

David Smiley commented on LUCENE-8509:
--

bq. The trim filter removes the leading space from the second token, leaving 
offsets unchanged

That sounds fishy though; shouldn't they be trivially adjusted?

I'm skeptical about your proposal RE WDGF being an improvement because 
tokenization splits offsets and WDGF is playing the role of a tokenizer.  
Perhaps your proposal could be a new option that perhaps even defaults the way 
you want it?  And we solicit feedback/input saying the ability to toggle may go 
away.  The option's default setting should probably be Version-dependent.

> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can 
> produce backwards offsets
> 
>
> Key: LUCENE-8509
> URL: https://issues.apache.org/jira/browse/LUCENE-8509
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here: 
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the 
> beginning of the second token).  The WDGF takes the first token and splits it 
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and 
> "b"[2,3].  The trim filter removes the leading space from the second token, 
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space 
> has already been stripped, WDGF sees no need to adjust offsets, and emits the 
> token as-is, resulting in the start offsets of the tokenstream being [0, 2, 
> 1], and the IndexWriter rejecting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

2018-10-24 Thread Alan Woodward (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662159#comment-16662159
 ] 

Alan Woodward commented on LUCENE-8509:
---

Here is a patch removing the offset-adjustment logic from WDGF.  All subtokens 
emitted by the filter now have the same offsets as their parent token.

The downstream consequences are that entire tokens will be highlighted (eg, if 
you search for 'wi' then the whole token 'wi-fi' will get highlighted).  I 
think this is a reasonable trade-off, though.  It brings things more in to line 
with the behaviour of SynonymGraphFilter as well.

> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can 
> produce backwards offsets
> 
>
> Key: LUCENE-8509
> URL: https://issues.apache.org/jira/browse/LUCENE-8509
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here: 
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the 
> beginning of the second token).  The WDGF takes the first token and splits it 
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and 
> "b"[2,3].  The trim filter removes the leading space from the second token, 
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space 
> has already been stripped, WDGF sees no need to adjust offsets, and emits the 
> token as-is, resulting in the start offsets of the tokenstream being [0, 2, 
> 1], and the IndexWriter rejecting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

2018-10-24 Thread Alan Woodward (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662137#comment-16662137
 ] 

Alan Woodward commented on LUCENE-8509:
---

A related case: https://github.com/elastic/elasticsearch/issues/34741

I think we should just change WordDelimiterGraphFilter so that it no longer 
adjusts offsets for its parts.

> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can 
> produce backwards offsets
> 
>
> Key: LUCENE-8509
> URL: https://issues.apache.org/jira/browse/LUCENE-8509
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
>
> Discovered by an elasticsearch user and described here: 
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the 
> beginning of the second token).  The WDGF takes the first token and splits it 
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and 
> "b"[2,3].  The trim filter removes the leading space from the second token, 
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space 
> has already been stripped, WDGF sees no need to adjust offsets, and emits the 
> token as-is, resulting in the start offsets of the tokenstream being [0, 2, 
> 1], and the IndexWriter rejecting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

2018-09-19 Thread Alan Woodward (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620596#comment-16620596
 ] 

Alan Woodward commented on LUCENE-8509:
---

I don't think this is fixable with the current setup, but it's another argument 
for making WordDelimiterGraphFilter a tokenizer.

> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can 
> produce backwards offsets
> 
>
> Key: LUCENE-8509
> URL: https://issues.apache.org/jira/browse/LUCENE-8509
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
>
> Discovered by an elasticsearch user and described here: 
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the 
> beginning of the second token).  The WDGF takes the first token and splits it 
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and 
> "b"[2,3].  The trim filter removes the leading space from the second token, 
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space 
> has already been stripped, WDGF sees no need to adjust offsets, and emits the 
> token as-is, resulting in the start offsets of the tokenstream being [0, 2, 
> 1], and the IndexWriter rejecting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org