subject:"\[jira\] \[Commented\] \(LUCENE\-4065\) FilteringTokenFilter should never corrupt the tokenstream graph"

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

2018-02-22 Thread Steve Rowe (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373774#comment-16373774
 ] 

Steve Rowe commented on LUCENE-4065:


bq. But your testcase argues that this should be 4 positions (1 + 2 + 1). I'm 
just not convinced thats the correct behavior: its unintuitive to me that a 
stopfilter would make a document "longer" in the sense of actually adding 
additional positions... (no, it doesn't impact length normalization because 
this value isn't used for that, but its just really confusing).

Yeah I think my testcase is just wrong - StopFilter's behavior actually looks 
correct here to me (testcase can be fixed by changing the "walking" posinc  
{{2}} -> {{1}}, which no longer fails).  I guess I misinterpreted Jim's example 
as graph corruption, which this doesn't look like to me.

> FilteringTokenFilter should never corrupt the tokenstream graph
> ---
>
> Key: LUCENE-4065
> URL: https://issues.apache.org/jira/browse/LUCENE-4065
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-4065_test.patch
>
>
> Currently removers like stopfilter have an option (true/false) to enable 
> position increments.
> If its true: it both inserts gaps where necessary AND propagates gaps down 
> the stream.
> If its false: it does neither, which can totally mess up the tokenstream 
> graph (e.g. move synonyms to another word).
> There are totally valid natural usecases for false, where you don't want gaps 
> because you want phrasequeries to act as if the word was never actually there.
> But 'not inserting gaps' is separate from proper propagation of existing gaps.
> So I think we should provide an option (either fix 'false' or make it an 
> enum), where you still get a legit tokenstream and dont totally screw it up, 
> but you simply omit gaps.
> See LUCENE-3848 for more information (Where we at least fixed this case to 
> not begin the tokenstream with posinc=0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

2018-02-22 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373757#comment-16373757
 ] 

Robert Muir commented on LUCENE-4065:
-

Yeah, i mean we should split it up. Its probably more important to figure out 
what this thing should be doing in the graph case (positionLengths). Because 
before the stopfilter i've got a total of 3 positions (1 + 0 + 1 + 1). Today 
the stopfilter deletes "the" and transfers the position to "twd", so i've still 
got 3 positions (1 + 1 + 1). 

But your testcase argues that this should be 4 positions (1 + 2 + 1). I'm just 
not convinced thats the correct behavior: its unintuitive to me that a 
stopfilter would make a document "longer" in the sense of actually adding 
additional positions... (no, it doesn't impact length normalization because 
this value isn't used for that, but its just really confusing).

> FilteringTokenFilter should never corrupt the tokenstream graph
> ---
>
> Key: LUCENE-4065
> URL: https://issues.apache.org/jira/browse/LUCENE-4065
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-4065_test.patch
>
>
> Currently removers like stopfilter have an option (true/false) to enable 
> position increments.
> If its true: it both inserts gaps where necessary AND propagates gaps down 
> the stream.
> If its false: it does neither, which can totally mess up the tokenstream 
> graph (e.g. move synonyms to another word).
> There are totally valid natural usecases for false, where you don't want gaps 
> because you want phrasequeries to act as if the word was never actually there.
> But 'not inserting gaps' is separate from proper propagation of existing gaps.
> So I think we should provide an option (either fix 'false' or make it an 
> enum), where you still get a legit tokenstream and dont totally screw it up, 
> but you simply omit gaps.
> See LUCENE-3848 for more information (Where we at least fixed this case to 
> not begin the tokenstream with posinc=0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

2018-02-22 Thread Steve Rowe (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373746#comment-16373746
 ] 

Steve Rowe commented on LUCENE-4065:


bq.  i think its totally separate from the whole idea of giving the user a gaps 
option.

Maybe change the issue title to that then?  (I was lumping it here because it's 
a form of token stream corruption)



> FilteringTokenFilter should never corrupt the tokenstream graph
> ---
>
> Key: LUCENE-4065
> URL: https://issues.apache.org/jira/browse/LUCENE-4065
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-4065_test.patch
>
>
> Currently removers like stopfilter have an option (true/false) to enable 
> position increments.
> If its true: it both inserts gaps where necessary AND propagates gaps down 
> the stream.
> If its false: it does neither, which can totally mess up the tokenstream 
> graph (e.g. move synonyms to another word).
> There are totally valid natural usecases for false, where you don't want gaps 
> because you want phrasequeries to act as if the word was never actually there.
> But 'not inserting gaps' is separate from proper propagation of existing gaps.
> So I think we should provide an option (either fix 'false' or make it an 
> enum), where you still get a legit tokenstream and dont totally screw it up, 
> but you simply omit gaps.
> See LUCENE-3848 for more information (Where we at least fixed this case to 
> not begin the tokenstream with posinc=0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

2018-02-22 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373727#comment-16373727
 ] 

Robert Muir commented on LUCENE-4065:
-

Well, i think thats a separate, "new" issue, related to positionLength. Keep in 
mind this JIRA issue was open before positionLength even existed at all. i 
think its totally separate from the whole idea of giving the user a gaps option.

This is what the thing looks like before stopfilter sees it. When you delete 
"the", it just transfers the 1 to twd. It doesn't currently look at 
positionlength at all.

{noformat}
SynonymGraphFilter->term=the,positionIncrement=1,positionLength=1,type=SYNONYM,termFrequency=1
SynonymGraphFilter->term=twd,positionIncrement=0,positionLength=3,type=word,termFrequency=1
SynonymGraphFilter->term=walking,positionIncrement=1,positionLength=1,type=SYNONYM,termFrequency=1
SynonymGraphFilter->term=dead,positionIncrement=1,positionLength=1,type=SYNONYM,termFrequency=1
{noformat}

To be honest, its unclear if stopfilter is really the culprit. Its definitely 
funky the way that SynonymGraphFilter makes the original word "twd" a "synonym" 
(posInc=0)... i think if it didn't do that, you wouldn't have that problem in 
this case. But i don't know if its a general solution to your problem.





> FilteringTokenFilter should never corrupt the tokenstream graph
> ---
>
> Key: LUCENE-4065
> URL: https://issues.apache.org/jira/browse/LUCENE-4065
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-4065_test.patch
>
>
> Currently removers like stopfilter have an option (true/false) to enable 
> position increments.
> If its true: it both inserts gaps where necessary AND propagates gaps down 
> the stream.
> If its false: it does neither, which can totally mess up the tokenstream 
> graph (e.g. move synonyms to another word).
> There are totally valid natural usecases for false, where you don't want gaps 
> because you want phrasequeries to act as if the word was never actually there.
> But 'not inserting gaps' is separate from proper propagation of existing gaps.
> So I think we should provide an option (either fix 'false' or make it an 
> enum), where you still get a legit tokenstream and dont totally screw it up, 
> but you simply omit gaps.
> See LUCENE-3848 for more information (Where we at least fixed this case to 
> not begin the tokenstream with posinc=0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

2018-02-22 Thread Steve Rowe (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373703#comment-16373703
 ] 

Steve Rowe commented on LUCENE-4065:


Based on [Jim Ferenczi's comment on 
SOLR-11968|https://issues.apache.org/jira/browse/SOLR-11968?focusedCommentId=16373554=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16373554],
 I created a failing test for StopFilter that shows that StopFilter can (still) 
corrupt the token stream - the failure message says that "walking" gets a 
posinc of 1 instead of 2, which means that the only way to interpret the "twd" 
token's poslen of 3 is as a trailing gap, which is misplaced:

{code:java|title=TestStopFilterFactory.java}
  public void testLeadingStopwordSynonymGraph() throws Exception {
SynonymMap.Builder builder = new SynonymMap.Builder(true);
builder.add(new CharsRef("twd"), new 
CharsRef("the\uwalking\udead"), true);
final SynonymMap synonymMap = builder.build();

Analyzer analyzer = new Analyzer() {
  @Override
  protected TokenStreamComponents createComponents(String fieldName) {
MockTokenizer tokenizer = new MockTokenizer();
TokenStream stream = new SynonymGraphFilter(tokenizer, synonymMap, 
true);
stream = new StopFilter(stream, 
CharArraySet.copy(Collections.singleton("the")));
return new TokenStreamComponents(tokenizer, stream);
  }
};
TokenStream tokenStream = analyzer.tokenStream("field", "twd");
assertTokenStreamContents(tokenStream, 
new String[] { "twd", "walking", "dead" },
null, null, 
new int[]{ 1, 2, 1  },  // posinc
new int[]{ 3, 1, 1  },  // poslen
null);
  }
{code}

> FilteringTokenFilter should never corrupt the tokenstream graph
> ---
>
> Key: LUCENE-4065
> URL: https://issues.apache.org/jira/browse/LUCENE-4065
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-4065_test.patch
>
>
> Currently removers like stopfilter have an option (true/false) to enable 
> position increments.
> If its true: it both inserts gaps where necessary AND propagates gaps down 
> the stream.
> If its false: it does neither, which can totally mess up the tokenstream 
> graph (e.g. move synonyms to another word).
> There are totally valid natural usecases for false, where you don't want gaps 
> because you want phrasequeries to act as if the word was never actually there.
> But 'not inserting gaps' is separate from proper propagation of existing gaps.
> So I think we should provide an option (either fix 'false' or make it an 
> enum), where you still get a legit tokenstream and dont totally screw it up, 
> but you simply omit gaps.
> See LUCENE-3848 for more information (Where we at least fixed this case to 
> not begin the tokenstream with posinc=0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

2018-02-21 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372072#comment-16372072
 ] 

Robert Muir commented on LUCENE-4065:
-

Yeah, you've got it. I really prefer your {{enableGaps}} name. Sorry, the issue 
is just confusing and I was struggling to try to explain it.

Today {{enableGaps}} is always true, which makes deletions pretty simple for 
FilteringTokenFilter. We just have to track an int variable! 

But I think we can potentially support enableGaps=false, and adjust 
positionIncrements/positionLengths so that the result is sane. That's the idea 
of this issue. I think no user _really_ wanted to disable position increments 
entirely before, nobody wants to move synonyms to the incorrect words or 
anything like that. They just want control over whether there are gaps or not: 
it impacts things like phrase queries.

> FilteringTokenFilter should never corrupt the tokenstream graph
> ---
>
> Key: LUCENE-4065
> URL: https://issues.apache.org/jira/browse/LUCENE-4065
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-4065_test.patch
>
>
> Currently removers like stopfilter have an option (true/false) to enable 
> position increments.
> If its true: it both inserts gaps where necessary AND propagates gaps down 
> the stream.
> If its false: it does neither, which can totally mess up the tokenstream 
> graph (e.g. move synonyms to another word).
> There are totally valid natural usecases for false, where you don't want gaps 
> because you want phrasequeries to act as if the word was never actually there.
> But 'not inserting gaps' is separate from proper propagation of existing gaps.
> So I think we should provide an option (either fix 'false' or make it an 
> enum), where you still get a legit tokenstream and dont totally screw it up, 
> but you simply omit gaps.
> See LUCENE-3848 for more information (Where we at least fixed this case to 
> not begin the tokenstream with posinc=0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

2018-02-21 Thread Steve Rowe (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372046#comment-16372046
 ] 

Steve Rowe commented on LUCENE-4065:


[~rcmuir] commented over on 
[https://issues.apache.org/jira/browse/SOLR-11968?focusedCommentId=16370916=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16370916|SOLR-11968]
 about this issue:

{quote}
I think the issue is still valid, its a little more complex now because of 
positionLength (means more buffering when you see posLength > 1, because you'll 
need to adjust if you remove something in its path), but the idea is the same: 
give the user a choice between "insert mode" and "replace mode". But this new 
"insert mode" would actually work correctly, correcting posLengths before and 
posIncs after as appropriate. similar to how your editor might have to 
recompute some line breaks/word wrapping and so on.

If you have baseball (length=2), base(length=1), ball(length=1), and you delete 
"base" in this case, you need to change baseball's length to 1 before you omit 
it, because you deleted base. Thats the "buffering before" that would be 
required for posLength. And you still need the same buffering described on the 
issue for posInc=0 that might occur after the fact, so you don't wrongly 
transfer synonyms to different words entirely.

It would be slower than "replace mode" that we have today, but only because of 
the buffering, and I think its pretty contained, but I haven't fully thought it 
thru or tried to write any code.
{quote}

 +1, though I find the nomenclature confusing; in your proposed "insert mode", 
token deletions would not leave any trace of the deleted tokens -- in posinc 
and poslen -- right?  (I get that you mean "insert mode" and "replace mode" as 
a metaphoric for editor operations.)  Isn't the issue just whether to leave 
gaps (as indicated by posinc and poslen) where deleted tokens were?

> FilteringTokenFilter should never corrupt the tokenstream graph
> ---
>
> Key: LUCENE-4065
> URL: https://issues.apache.org/jira/browse/LUCENE-4065
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-4065_test.patch
>
>
> Currently removers like stopfilter have an option (true/false) to enable 
> position increments.
> If its true: it both inserts gaps where necessary AND propagates gaps down 
> the stream.
> If its false: it does neither, which can totally mess up the tokenstream 
> graph (e.g. move synonyms to another word).
> There are totally valid natural usecases for false, where you don't want gaps 
> because you want phrasequeries to act as if the word was never actually there.
> But 'not inserting gaps' is separate from proper propagation of existing gaps.
> So I think we should provide an option (either fix 'false' or make it an 
> enum), where you still get a legit tokenstream and dont totally screw it up, 
> but you simply omit gaps.
> See LUCENE-3848 for more information (Where we at least fixed this case to 
> not begin the tokenstream with posinc=0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

2016-09-02 Thread Eduard Dudar (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15460164#comment-15460164
 ] 

Eduard Dudar commented on LUCENE-4065:
--

enable position increments is incredibly useful for shingles filter in ES in 
particular, ending up with pairs like "_ _" is unpractical. thanks.

> FilteringTokenFilter should never corrupt the tokenstream graph
> ---
>
> Key: LUCENE-4065
> URL: https://issues.apache.org/jira/browse/LUCENE-4065
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Robert Muir
> Attachments: LUCENE-4065_test.patch
>
>
> Currently removers like stopfilter have an option (true/false) to enable 
> position increments.
> If its true: it both inserts gaps where necessary AND propagates gaps down 
> the stream.
> If its false: it does neither, which can totally mess up the tokenstream 
> graph (e.g. move synonyms to another word).
> There are totally valid natural usecases for false, where you don't want gaps 
> because you want phrasequeries to act as if the word was never actually there.
> But 'not inserting gaps' is separate from proper propagation of existing gaps.
> So I think we should provide an option (either fix 'false' or make it an 
> enum), where you still get a legit tokenstream and dont totally screw it up, 
> but you simply omit gaps.
> See LUCENE-3848 for more information (Where we at least fixed this case to 
> not begin the tokenstream with posinc=0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

2013-01-09 Thread Commit Tag Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548651#comment-13548651
 ] 

Commit Tag Bot commented on LUCENE-4065:


[trunk commit] Robert Muir
http://svn.apache.org/viewvc?view=revisionrevision=1430939

LUCENE-4065: shitlist these broken ctors so they dont cause false fails


 FilteringTokenFilter should never corrupt the tokenstream graph
 ---

 Key: LUCENE-4065
 URL: https://issues.apache.org/jira/browse/LUCENE-4065
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Reporter: Robert Muir
 Attachments: LUCENE-4065_test.patch


 Currently removers like stopfilter have an option (true/false) to enable 
 position increments.
 If its true: it both inserts gaps where necessary AND propagates gaps down 
 the stream.
 If its false: it does neither, which can totally mess up the tokenstream 
 graph (e.g. move synonyms to another word).
 There are totally valid natural usecases for false, where you don't want gaps 
 because you want phrasequeries to act as if the word was never actually there.
 But 'not inserting gaps' is separate from proper propagation of existing gaps.
 So I think we should provide an option (either fix 'false' or make it an 
 enum), where you still get a legit tokenstream and dont totally screw it up, 
 but you simply omit gaps.
 See LUCENE-3848 for more information (Where we at least fixed this case to 
 not begin the tokenstream with posinc=0)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

2013-01-09 Thread Commit Tag Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548662#comment-13548662
 ] 

Commit Tag Bot commented on LUCENE-4065:


[branch_4x commit] Robert Muir
http://svn.apache.org/viewvc?view=revisionrevision=1430944

LUCENE-4065: shitlist these broken ctors so they dont cause false fails


 FilteringTokenFilter should never corrupt the tokenstream graph
 ---

 Key: LUCENE-4065
 URL: https://issues.apache.org/jira/browse/LUCENE-4065
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Reporter: Robert Muir
 Attachments: LUCENE-4065_test.patch


 Currently removers like stopfilter have an option (true/false) to enable 
 position increments.
 If its true: it both inserts gaps where necessary AND propagates gaps down 
 the stream.
 If its false: it does neither, which can totally mess up the tokenstream 
 graph (e.g. move synonyms to another word).
 There are totally valid natural usecases for false, where you don't want gaps 
 because you want phrasequeries to act as if the word was never actually there.
 But 'not inserting gaps' is separate from proper propagation of existing gaps.
 So I think we should provide an option (either fix 'false' or make it an 
 enum), where you still get a legit tokenstream and dont totally screw it up, 
 but you simply omit gaps.
 See LUCENE-3848 for more information (Where we at least fixed this case to 
 not begin the tokenstream with posinc=0)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

2012-05-17 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277805#comment-13277805
]

Robert Muir commented on LUCENE-4065:
-

Another way to see it:
imagine i have 'my test case'
and i have a synonyms set with a single mapping: test=example

So synonymfilter makes: 'my test/example case'. Example has posinc=0

if we have a stopfilter with posinc=false that has a single stopword (test),
then we end out with 'my/example case'.

But in my opinion this should be 'my example case': e.g. we should propagate
the posinc=1 of 'test' to example. We arent introducing a gap though, just
preventing
insane graph corruption and restacking of synonyms.

FilteringTokenFilter should never corrupt the tokenstream graph
---

Key: LUCENE-4065
URL: https://issues.apache.org/jira/browse/LUCENE-4065
Project: Lucene - Java
Issue Type: Bug
Components: modules/analysis
Reporter: Robert Muir
Attachments: LUCENE-4065_test.patch

Currently removers like stopfilter have an option (true/false) to enable
position increments.
If its true: it both inserts gaps where necessary AND propagates gaps down
the stream.
If its false: it does neither, which can totally mess up the tokenstream
graph (e.g. move synonyms to another word).
There are totally valid natural usecases for false, where you don't want gaps
because you want phrasequeries to act as if the word was never actually there.
But 'not inserting gaps' is separate from proper propagation of existing gaps.
So I think we should provide an option (either fix 'false' or make it an
enum), where you still get a legit tokenstream and dont totally screw it up,
but you simply omit gaps.
See LUCENE-3848 for more information (Where we at least fixed this case to
not begin the tokenstream with posinc=0)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

11 matches

Site Navigation

Mail list logo

Footer information