[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16465152#comment-16465152 ] Robert Muir commented on LUCENE-7960: - Again I want to re-emphasize that anything more complex than a single boolean "preserveOriginal" is too much. If someone wants to remove too-short or too-long terms they can use LengthFilter for that. There is no need to have such complex stuff i the ngram filters itself. Furthermore I still think we need to address the traps I mentioned about about these filters emitting too many tokens already before we then go and add an option to make them produce even more... > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Attachments: LUCENE-7960.patch, LUCENE-7960.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463930#comment-16463930 ] Shawn Heisey commented on LUCENE-7960: -- On first blush, an enum seems even more of a mess than one or two extra boolean parameters. I will let [~rcmuir] and others with more experience in these matters make the call on that. How would the factory (and by extension, text configuration files) handle it? If two boolean parameters is going to meet with resistance, I can support preserveOriginal. I think users will want long and short handled separately, but one flag would get the job done. > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Attachments: LUCENE-7960.patch, LUCENE-7960.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461976#comment-16461976 ] Ingomar Wesp commented on LUCENE-7960: -- I understand your concern. As far as I can tell, there are two options: 1) Replace the booleans with an enum that covers the four possible combinations. Maybe "keepMode" with values "DROP", "KEEP_SHORT_TERM", "KEEP_LONG_TERM", "KEEP_ALL". I'm not really happy with the names, though - advice welcome. 2) Fold the two booleans into one - maybe "preserveOriginal", akin to how the corresponding attribute in other filters is called. I personally prefer 1), but I'd happily adapt the patch to implement the other if it makes things easier from your perspective. > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Attachments: LUCENE-7960.patch, LUCENE-7960.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461792#comment-16461792 ] Shawn Heisey commented on LUCENE-7960: -- That idea had nothing to do with the number of booleans. Only with making any extra arguments (no matter how many there are) optional. > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Attachments: LUCENE-7960.patch, LUCENE-7960.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461782#comment-16461782 ] Robert Muir commented on LUCENE-7960: - Sorry, varargs are completely uncalled for here. Arguing for 250 booleans instead of just 1 boolean isn't going to work as a "negotiating" strategy to get back to 2. Please take my recommendations seriously. > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Attachments: LUCENE-7960.patch, LUCENE-7960.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461777#comment-16461777 ] Shawn Heisey commented on LUCENE-7960: -- The one thing that I do not know is whether an added argument with the ellipsis notation preserves API compatibility. If we did that, would a program originally compiled against an older Lucene version still work correctly with the added parameter? I know that everything would be fine if the program were re-compiled. Which I think technically meets our overall goal for a minor release, but preserving binary compatibility when possible is a good bonus. > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Attachments: LUCENE-7960.patch, LUCENE-7960.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461767#comment-16461767 ] Shawn Heisey commented on LUCENE-7960: -- An example of where I used the ellipsis notation in my own code to make a boolean argument optional: {code:java} /** * Fully close a connection, statement, and result set, ignoring any errors * that occur. Any of the three resources here can be null, but at least one * of them must NOT be null. * * @param rs the ResultSet to close. * @param st the Statement to close. * @param conn The Connection to close. * @param forceFlags This odd ellipsis parameter is used for one thing *currently: a flag to indicate whether or not a close will be *forced on all provided resources even if everything doesn't *match up. In theory, a statement derived from the resultset *and connections derived from either one should be exactly the *same object as the ones provided to the method. If the flag is *false, then only the first non-null resource provided and any *parent resources derived from that resource will be closed. If *it is true, ALL resources including derived resources will be *closed. Mismatches will be logged either way. The ellipsis *notation is so that this parameter is optional. If omitted, it *will default to false. * @throws IllegalArgumentException if all three resources are null. */ public static void fullQuietClose(ResultSet rs, Statement st, Connection conn, boolean... forceFlags) {code} > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Attachments: LUCENE-7960.patch, LUCENE-7960.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461757#comment-16461757 ] Shawn Heisey commented on LUCENE-7960: -- I just thought of a particularly ugly idea that would preserve the current 3-arg capability *and* allow the extra booleans. Make the constructor signature this: {code} public EdgeNGramTokenFilter( TokenStream input, int minGram, int maxGram, boolean... flags)) { {code} I sometimes do things like this in my own code with methods that nobody else is going to use. But for a public API like Lucene, is that as bad an idea as it seems? > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Attachments: LUCENE-7960.patch, LUCENE-7960.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461700#comment-16461700 ] Shawn Heisey commented on LUCENE-7960: -- The "obvious" workaround to either situation is to decrease minGram and/or increase maxGram. I find that increasing maxGram doesn't meet with a lot of resistance ... but decreasing minGram can lead to massive term explosion (with possible performance ramifications) and a big shift in recall/precision balance. > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Attachments: LUCENE-7960.patch, LUCENE-7960.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461695#comment-16461695 ] Shawn Heisey commented on LUCENE-7960: -- My original idea would have been handled by one boolean -- keeping terms shorter than minGram. On more than one occasion, I've fielded questions where it turns out the user is trying to search for terms shorter than their minGram size. In discussing it, the notion of *long* terms being removed by the min/max range also came up. It was an idea I had not originally considered, but I have encountered someone since where they had ngram on the index side but not the query side, and wanted to search for terms longer than their maxGram size. It could be reduced to one "keep" boolean to keep both short and long terms, but I think we're going to have people who want to keep short terms but not long terms, and vice versa. > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Attachments: LUCENE-7960.patch, LUCENE-7960.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461683#comment-16461683 ] Robert Muir commented on LUCENE-7960: - my biggest concern is that these filters would then have two ctors: * NGramTokenFilter(TokenStream) * NGramTokenFilter(TokenStream, int, int, boolean, boolean) The no-arg one starts looking more attractive to users at this point, and its mega-trappy (n=1,2)!!! That's the ctor that should be deprecated :) In general I'll be honest, I don't like how trappy the apis are with these filters/tokenizers because of defaults like that. I also think its trappy they take a min and a max at all, because that's really creating (max-min) indexed fields all unioned into one. There aren't even any warnings about this. I haven't reviewed what the booleans of the patch does, but I am concerned that the use case may just be "keep original" which could be one boolean, or perhaps done in a different way entirely (e.g. KeywordRepeatFilter or perhaps something like LUCENE-8273). So if its acceptable to collapse it into one boolean that does that, I think that would be easier. I feel like any defaults that our apis lead to (and when you have multiple ctors, then thats a default) should be something that will perform and scale well and work for the general case. For example n=4 has been shown to work well in many relevance experiments. At least we should make it easy for you to explicitly ask for something like that without passing many parameters. > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Attachments: LUCENE-7960.patch, LUCENE-7960.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461673#comment-16461673 ] Shawn Heisey commented on LUCENE-7960: -- Updated patch. Does not deprecate constructors, does not fiddle with constructor usage in non-test code. > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Attachments: LUCENE-7960.patch, LUCENE-7960.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461660#comment-16461660 ] Shawn Heisey commented on LUCENE-7960: -- [~rcmuir] so you would keep the current constructor around permanently? I have no objection if that's what you'd prefer. [~iwesp] You did preserve it. When I looked at the patch (not the result of applying the patch) it looked like a replacement, which prompted that comment. > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Attachments: LUCENE-7960.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461514#comment-16461514 ] Robert Muir commented on LUCENE-7960: - The patch doesn't add up to me. The description of this issue claims that the default behavior wouldn't be changed, but then the patch does just the opposite and makes the new parameters mandatory. 5 arguments is too many here, that's not usable IMO. > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Attachments: LUCENE-7960.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461325#comment-16461325 ] Ingomar Wesp commented on LUCENE-7960: -- Thanks a lot for your support. I don't quite understand your comment regarding the constructors: Unless I'm missing something, I think I _did_ preserve the original ones, which now delegate to the new ctors using defaullt values. Is there anything left that I can or should do to get this into master? > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Attachments: LUCENE-7960.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460312#comment-16460312 ] Shawn Heisey commented on LUCENE-7960: -- Updated patch added. Deprecates the existing 3-arg constructors, and removes all usage of the deprecated constructors from the codebase. Tests in lucene/analysis and precommit at the root are passing. > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Attachments: LUCENE-7960.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460304#comment-16460304 ] Shawn Heisey commented on LUCENE-7960: -- Applying the PR as-is does seem to work. All the tests are passing. I'm working on some minor alterations. I've got precommit running, so far it looks good. > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460226#comment-16460226 ] Shawn Heisey commented on LUCENE-7960: -- I have basically come to the conclusion that I have absolutely no idea how this stuff works and cannot make any sense out of what the patch does, or even what the classes are doing *before* the modifications. > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459873#comment-16459873 ] Shawn Heisey commented on LUCENE-7960: -- I've gotten a look at the PR. Changing the signature on an existing constructor isn't a good idea. Lucene is a public API and there will be user code using that constructor that must continue to work if Lucene is upgraded. We should add a new constructor and have the existing constructor(s) call that one with default values. The only question about that is whether the existing constructor should be deprecated in stable and removed in master. I'm not sure who to ask. There are some variable renames. They don't look like problems, especially because the visibility is private, but I'd like to get the opinion of someone who has deeper Lucene knowledge. I'm having a difficult time following the modifications to the filter logic. Some of the modifications look like they're not directly related to implementing this issue, but I can't tell for sure. > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453005#comment-16453005 ] Ingomar Wesp commented on LUCENE-7960: -- I've just updated the patch in PR #362. I now also have a working patch for NGramTokenizer and EdgeNGramTokenizer in a separate branch, but it's still pretty messy and thus not yet ready for a PR. Since this issue deals with the filters specifically, could someone could have a look at PR #362 and merge it if it's acceptable? Once this is done, I would then open another issue for the tokenizers. > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426221#comment-16426221 ] Ingomar Wesp commented on LUCENE-7960: -- Ok, I just added the same paramteters to the NGramTokenFilter and updated the pull request. In the long term, it probably makes sense to move all the logic into the NGramTokenFilter and turn EdgeNGramTokenFilter into a simple wrapper. EdgeNGramTokenizer is already implemented this way. I presume it also makes sense to extend NGramTokenizer and EdgeNGramTokenizer accordingly? > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421621#comment-16421621 ] Ingomar Wesp commented on LUCENE-7960: -- Thanks for your feedback! Yes, this is exactly the behavior I tried to implement. I'll rename the parameters according to your suggestion. I will also have a look at the other ngram components as soon as I have time. > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421479#comment-16421479 ] Shawn Heisey commented on LUCENE-7960: -- When I created this issue, I didn't think about long terms. Somebody probably needs the functionality. On further reflection, I don't think that new parameter names should be plural. Using "keepShortTerm" and "keepLongTerm" sounds better to me. They could both be enabled if that's what the user wants. The same options should be added to all ngram analysis components, not just EdgeNgramFilter. Let's consider a min length of 4 and a max length of 6, using EdgeNgramFilter. If the input term is "abcdefgh", here's the basic term list from the filter: abcd abcde abcdef If keepShortTerm is enabled with the longer input, there is no change. If keepLongTerm is enabled with the longer input, then the term list will be: abcd abcde abcdef abcdefgh The seven-character string would not be created. If that's what the user wants, they should just increase the max value, rather than enable the new option. If the input term is "ab", then the filter would not normally produce any terms. With keepShortTerm, the output would be the input -- "ab". The three-character term would not be produced. If the user wants that, they would need to reduce the min value. The keepLongTerm option would have no effect with a short input. I did glance at the patch, but didn't examine it in detail, so I don't know if it does what I just described or not. > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
[ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421475#comment-16421475 ] Ingomar Wesp commented on LUCENE-7960: -- I'd like to propose a patch (see attached pull request #349) that adds two options to the EdgeNGramFilter: * keepShortTerms: Causes the filter to pass through input terms that are shorter than the minimum gram size. * keepLongTerms: Causes the filter to pass through input terms that are longer than the maximum gram size. I'm not entirely sure about the usefulness of keepLongTerms, but enabling the ability pass through short terms would certainly be neat for queries where you'd like to match ALL tokens as either prefixes or exact terms, but some query tokens are shorter than the minimum gram size. As far is I understand, a second field containing the exact terms isn't really a viable alternative there, because you can easily run into situations where only a subset of query tokens matches for either field. > NGram filters -- add option to keep short terms > --- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Shawn Heisey >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the > minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of > problems for users. I am not suggesting that the default behavior be > changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like > keepShortTerms, that defaults to false, to allow the short terms to be > preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org