[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph
[ https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373774#comment-16373774 ] Steve Rowe commented on LUCENE-4065: bq. But your testcase argues that this should be 4 positions (1 + 2 + 1). I'm just not convinced thats the correct behavior: its unintuitive to me that a stopfilter would make a document "longer" in the sense of actually adding additional positions... (no, it doesn't impact length normalization because this value isn't used for that, but its just really confusing). Yeah I think my testcase is just wrong - StopFilter's behavior actually looks correct here to me (testcase can be fixed by changing the "walking" posinc {{2}} -> {{1}}, which no longer fails). I guess I misinterpreted Jim's example as graph corruption, which this doesn't look like to me. > FilteringTokenFilter should never corrupt the tokenstream graph > --- > > Key: LUCENE-4065 > URL: https://issues.apache.org/jira/browse/LUCENE-4065 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-4065_test.patch > > > Currently removers like stopfilter have an option (true/false) to enable > position increments. > If its true: it both inserts gaps where necessary AND propagates gaps down > the stream. > If its false: it does neither, which can totally mess up the tokenstream > graph (e.g. move synonyms to another word). > There are totally valid natural usecases for false, where you don't want gaps > because you want phrasequeries to act as if the word was never actually there. > But 'not inserting gaps' is separate from proper propagation of existing gaps. > So I think we should provide an option (either fix 'false' or make it an > enum), where you still get a legit tokenstream and dont totally screw it up, > but you simply omit gaps. > See LUCENE-3848 for more information (Where we at least fixed this case to > not begin the tokenstream with posinc=0) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph
[ https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373757#comment-16373757 ] Robert Muir commented on LUCENE-4065: - Yeah, i mean we should split it up. Its probably more important to figure out what this thing should be doing in the graph case (positionLengths). Because before the stopfilter i've got a total of 3 positions (1 + 0 + 1 + 1). Today the stopfilter deletes "the" and transfers the position to "twd", so i've still got 3 positions (1 + 1 + 1). But your testcase argues that this should be 4 positions (1 + 2 + 1). I'm just not convinced thats the correct behavior: its unintuitive to me that a stopfilter would make a document "longer" in the sense of actually adding additional positions... (no, it doesn't impact length normalization because this value isn't used for that, but its just really confusing). > FilteringTokenFilter should never corrupt the tokenstream graph > --- > > Key: LUCENE-4065 > URL: https://issues.apache.org/jira/browse/LUCENE-4065 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-4065_test.patch > > > Currently removers like stopfilter have an option (true/false) to enable > position increments. > If its true: it both inserts gaps where necessary AND propagates gaps down > the stream. > If its false: it does neither, which can totally mess up the tokenstream > graph (e.g. move synonyms to another word). > There are totally valid natural usecases for false, where you don't want gaps > because you want phrasequeries to act as if the word was never actually there. > But 'not inserting gaps' is separate from proper propagation of existing gaps. > So I think we should provide an option (either fix 'false' or make it an > enum), where you still get a legit tokenstream and dont totally screw it up, > but you simply omit gaps. > See LUCENE-3848 for more information (Where we at least fixed this case to > not begin the tokenstream with posinc=0) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph
[ https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373746#comment-16373746 ] Steve Rowe commented on LUCENE-4065: bq. i think its totally separate from the whole idea of giving the user a gaps option. Maybe change the issue title to that then? (I was lumping it here because it's a form of token stream corruption) > FilteringTokenFilter should never corrupt the tokenstream graph > --- > > Key: LUCENE-4065 > URL: https://issues.apache.org/jira/browse/LUCENE-4065 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-4065_test.patch > > > Currently removers like stopfilter have an option (true/false) to enable > position increments. > If its true: it both inserts gaps where necessary AND propagates gaps down > the stream. > If its false: it does neither, which can totally mess up the tokenstream > graph (e.g. move synonyms to another word). > There are totally valid natural usecases for false, where you don't want gaps > because you want phrasequeries to act as if the word was never actually there. > But 'not inserting gaps' is separate from proper propagation of existing gaps. > So I think we should provide an option (either fix 'false' or make it an > enum), where you still get a legit tokenstream and dont totally screw it up, > but you simply omit gaps. > See LUCENE-3848 for more information (Where we at least fixed this case to > not begin the tokenstream with posinc=0) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph
[ https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373727#comment-16373727 ] Robert Muir commented on LUCENE-4065: - Well, i think thats a separate, "new" issue, related to positionLength. Keep in mind this JIRA issue was open before positionLength even existed at all. i think its totally separate from the whole idea of giving the user a gaps option. This is what the thing looks like before stopfilter sees it. When you delete "the", it just transfers the 1 to twd. It doesn't currently look at positionlength at all. {noformat} SynonymGraphFilter->term=the,positionIncrement=1,positionLength=1,type=SYNONYM,termFrequency=1 SynonymGraphFilter->term=twd,positionIncrement=0,positionLength=3,type=word,termFrequency=1 SynonymGraphFilter->term=walking,positionIncrement=1,positionLength=1,type=SYNONYM,termFrequency=1 SynonymGraphFilter->term=dead,positionIncrement=1,positionLength=1,type=SYNONYM,termFrequency=1 {noformat} To be honest, its unclear if stopfilter is really the culprit. Its definitely funky the way that SynonymGraphFilter makes the original word "twd" a "synonym" (posInc=0)... i think if it didn't do that, you wouldn't have that problem in this case. But i don't know if its a general solution to your problem. > FilteringTokenFilter should never corrupt the tokenstream graph > --- > > Key: LUCENE-4065 > URL: https://issues.apache.org/jira/browse/LUCENE-4065 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-4065_test.patch > > > Currently removers like stopfilter have an option (true/false) to enable > position increments. > If its true: it both inserts gaps where necessary AND propagates gaps down > the stream. > If its false: it does neither, which can totally mess up the tokenstream > graph (e.g. move synonyms to another word). > There are totally valid natural usecases for false, where you don't want gaps > because you want phrasequeries to act as if the word was never actually there. > But 'not inserting gaps' is separate from proper propagation of existing gaps. > So I think we should provide an option (either fix 'false' or make it an > enum), where you still get a legit tokenstream and dont totally screw it up, > but you simply omit gaps. > See LUCENE-3848 for more information (Where we at least fixed this case to > not begin the tokenstream with posinc=0) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph
[ https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373703#comment-16373703 ] Steve Rowe commented on LUCENE-4065: Based on [Jim Ferenczi's comment on SOLR-11968|https://issues.apache.org/jira/browse/SOLR-11968?focusedCommentId=16373554=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16373554], I created a failing test for StopFilter that shows that StopFilter can (still) corrupt the token stream - the failure message says that "walking" gets a posinc of 1 instead of 2, which means that the only way to interpret the "twd" token's poslen of 3 is as a trailing gap, which is misplaced: {code:java|title=TestStopFilterFactory.java} public void testLeadingStopwordSynonymGraph() throws Exception { SynonymMap.Builder builder = new SynonymMap.Builder(true); builder.add(new CharsRef("twd"), new CharsRef("the\uwalking\udead"), true); final SynonymMap synonymMap = builder.build(); Analyzer analyzer = new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName) { MockTokenizer tokenizer = new MockTokenizer(); TokenStream stream = new SynonymGraphFilter(tokenizer, synonymMap, true); stream = new StopFilter(stream, CharArraySet.copy(Collections.singleton("the"))); return new TokenStreamComponents(tokenizer, stream); } }; TokenStream tokenStream = analyzer.tokenStream("field", "twd"); assertTokenStreamContents(tokenStream, new String[] { "twd", "walking", "dead" }, null, null, new int[]{ 1, 2, 1 }, // posinc new int[]{ 3, 1, 1 }, // poslen null); } {code} > FilteringTokenFilter should never corrupt the tokenstream graph > --- > > Key: LUCENE-4065 > URL: https://issues.apache.org/jira/browse/LUCENE-4065 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-4065_test.patch > > > Currently removers like stopfilter have an option (true/false) to enable > position increments. > If its true: it both inserts gaps where necessary AND propagates gaps down > the stream. > If its false: it does neither, which can totally mess up the tokenstream > graph (e.g. move synonyms to another word). > There are totally valid natural usecases for false, where you don't want gaps > because you want phrasequeries to act as if the word was never actually there. > But 'not inserting gaps' is separate from proper propagation of existing gaps. > So I think we should provide an option (either fix 'false' or make it an > enum), where you still get a legit tokenstream and dont totally screw it up, > but you simply omit gaps. > See LUCENE-3848 for more information (Where we at least fixed this case to > not begin the tokenstream with posinc=0) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph
[ https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372072#comment-16372072 ] Robert Muir commented on LUCENE-4065: - Yeah, you've got it. I really prefer your {{enableGaps}} name. Sorry, the issue is just confusing and I was struggling to try to explain it. Today {{enableGaps}} is always true, which makes deletions pretty simple for FilteringTokenFilter. We just have to track an int variable! But I think we can potentially support enableGaps=false, and adjust positionIncrements/positionLengths so that the result is sane. That's the idea of this issue. I think no user _really_ wanted to disable position increments entirely before, nobody wants to move synonyms to the incorrect words or anything like that. They just want control over whether there are gaps or not: it impacts things like phrase queries. > FilteringTokenFilter should never corrupt the tokenstream graph > --- > > Key: LUCENE-4065 > URL: https://issues.apache.org/jira/browse/LUCENE-4065 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-4065_test.patch > > > Currently removers like stopfilter have an option (true/false) to enable > position increments. > If its true: it both inserts gaps where necessary AND propagates gaps down > the stream. > If its false: it does neither, which can totally mess up the tokenstream > graph (e.g. move synonyms to another word). > There are totally valid natural usecases for false, where you don't want gaps > because you want phrasequeries to act as if the word was never actually there. > But 'not inserting gaps' is separate from proper propagation of existing gaps. > So I think we should provide an option (either fix 'false' or make it an > enum), where you still get a legit tokenstream and dont totally screw it up, > but you simply omit gaps. > See LUCENE-3848 for more information (Where we at least fixed this case to > not begin the tokenstream with posinc=0) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph
[ https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372046#comment-16372046 ] Steve Rowe commented on LUCENE-4065: [~rcmuir] commented over on [https://issues.apache.org/jira/browse/SOLR-11968?focusedCommentId=16370916=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16370916|SOLR-11968] about this issue: {quote} I think the issue is still valid, its a little more complex now because of positionLength (means more buffering when you see posLength > 1, because you'll need to adjust if you remove something in its path), but the idea is the same: give the user a choice between "insert mode" and "replace mode". But this new "insert mode" would actually work correctly, correcting posLengths before and posIncs after as appropriate. similar to how your editor might have to recompute some line breaks/word wrapping and so on. If you have baseball (length=2), base(length=1), ball(length=1), and you delete "base" in this case, you need to change baseball's length to 1 before you omit it, because you deleted base. Thats the "buffering before" that would be required for posLength. And you still need the same buffering described on the issue for posInc=0 that might occur after the fact, so you don't wrongly transfer synonyms to different words entirely. It would be slower than "replace mode" that we have today, but only because of the buffering, and I think its pretty contained, but I haven't fully thought it thru or tried to write any code. {quote} +1, though I find the nomenclature confusing; in your proposed "insert mode", token deletions would not leave any trace of the deleted tokens -- in posinc and poslen -- right? (I get that you mean "insert mode" and "replace mode" as a metaphoric for editor operations.) Isn't the issue just whether to leave gaps (as indicated by posinc and poslen) where deleted tokens were? > FilteringTokenFilter should never corrupt the tokenstream graph > --- > > Key: LUCENE-4065 > URL: https://issues.apache.org/jira/browse/LUCENE-4065 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-4065_test.patch > > > Currently removers like stopfilter have an option (true/false) to enable > position increments. > If its true: it both inserts gaps where necessary AND propagates gaps down > the stream. > If its false: it does neither, which can totally mess up the tokenstream > graph (e.g. move synonyms to another word). > There are totally valid natural usecases for false, where you don't want gaps > because you want phrasequeries to act as if the word was never actually there. > But 'not inserting gaps' is separate from proper propagation of existing gaps. > So I think we should provide an option (either fix 'false' or make it an > enum), where you still get a legit tokenstream and dont totally screw it up, > but you simply omit gaps. > See LUCENE-3848 for more information (Where we at least fixed this case to > not begin the tokenstream with posinc=0) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph
[ https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15460164#comment-15460164 ] Eduard Dudar commented on LUCENE-4065: -- enable position increments is incredibly useful for shingles filter in ES in particular, ending up with pairs like "_ _" is unpractical. thanks. > FilteringTokenFilter should never corrupt the tokenstream graph > --- > > Key: LUCENE-4065 > URL: https://issues.apache.org/jira/browse/LUCENE-4065 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Robert Muir > Attachments: LUCENE-4065_test.patch > > > Currently removers like stopfilter have an option (true/false) to enable > position increments. > If its true: it both inserts gaps where necessary AND propagates gaps down > the stream. > If its false: it does neither, which can totally mess up the tokenstream > graph (e.g. move synonyms to another word). > There are totally valid natural usecases for false, where you don't want gaps > because you want phrasequeries to act as if the word was never actually there. > But 'not inserting gaps' is separate from proper propagation of existing gaps. > So I think we should provide an option (either fix 'false' or make it an > enum), where you still get a legit tokenstream and dont totally screw it up, > but you simply omit gaps. > See LUCENE-3848 for more information (Where we at least fixed this case to > not begin the tokenstream with posinc=0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph
[ https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548651#comment-13548651 ] Commit Tag Bot commented on LUCENE-4065: [trunk commit] Robert Muir http://svn.apache.org/viewvc?view=revisionrevision=1430939 LUCENE-4065: shitlist these broken ctors so they dont cause false fails FilteringTokenFilter should never corrupt the tokenstream graph --- Key: LUCENE-4065 URL: https://issues.apache.org/jira/browse/LUCENE-4065 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Reporter: Robert Muir Attachments: LUCENE-4065_test.patch Currently removers like stopfilter have an option (true/false) to enable position increments. If its true: it both inserts gaps where necessary AND propagates gaps down the stream. If its false: it does neither, which can totally mess up the tokenstream graph (e.g. move synonyms to another word). There are totally valid natural usecases for false, where you don't want gaps because you want phrasequeries to act as if the word was never actually there. But 'not inserting gaps' is separate from proper propagation of existing gaps. So I think we should provide an option (either fix 'false' or make it an enum), where you still get a legit tokenstream and dont totally screw it up, but you simply omit gaps. See LUCENE-3848 for more information (Where we at least fixed this case to not begin the tokenstream with posinc=0) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph
[ https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13548662#comment-13548662 ] Commit Tag Bot commented on LUCENE-4065: [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revisionrevision=1430944 LUCENE-4065: shitlist these broken ctors so they dont cause false fails FilteringTokenFilter should never corrupt the tokenstream graph --- Key: LUCENE-4065 URL: https://issues.apache.org/jira/browse/LUCENE-4065 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Reporter: Robert Muir Attachments: LUCENE-4065_test.patch Currently removers like stopfilter have an option (true/false) to enable position increments. If its true: it both inserts gaps where necessary AND propagates gaps down the stream. If its false: it does neither, which can totally mess up the tokenstream graph (e.g. move synonyms to another word). There are totally valid natural usecases for false, where you don't want gaps because you want phrasequeries to act as if the word was never actually there. But 'not inserting gaps' is separate from proper propagation of existing gaps. So I think we should provide an option (either fix 'false' or make it an enum), where you still get a legit tokenstream and dont totally screw it up, but you simply omit gaps. See LUCENE-3848 for more information (Where we at least fixed this case to not begin the tokenstream with posinc=0) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph
[ https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277805#comment-13277805 ] Robert Muir commented on LUCENE-4065: - Another way to see it: imagine i have 'my test case' and i have a synonyms set with a single mapping: test=example So synonymfilter makes: 'my test/example case'. Example has posinc=0 if we have a stopfilter with posinc=false that has a single stopword (test), then we end out with 'my/example case'. But in my opinion this should be 'my example case': e.g. we should propagate the posinc=1 of 'test' to example. We arent introducing a gap though, just preventing insane graph corruption and restacking of synonyms. FilteringTokenFilter should never corrupt the tokenstream graph --- Key: LUCENE-4065 URL: https://issues.apache.org/jira/browse/LUCENE-4065 Project: Lucene - Java Issue Type: Bug Components: modules/analysis Reporter: Robert Muir Attachments: LUCENE-4065_test.patch Currently removers like stopfilter have an option (true/false) to enable position increments. If its true: it both inserts gaps where necessary AND propagates gaps down the stream. If its false: it does neither, which can totally mess up the tokenstream graph (e.g. move synonyms to another word). There are totally valid natural usecases for false, where you don't want gaps because you want phrasequeries to act as if the word was never actually there. But 'not inserting gaps' is separate from proper propagation of existing gaps. So I think we should provide an option (either fix 'false' or make it an enum), where you still get a legit tokenstream and dont totally screw it up, but you simply omit gaps. See LUCENE-3848 for more information (Where we at least fixed this case to not begin the tokenstream with posinc=0) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org