[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16465152#comment-16465152
 ] 

Robert Muir commented on LUCENE-7960:
-

Again I want to re-emphasize that anything more complex than a single boolean 
"preserveOriginal" is too much. If someone wants to remove too-short or 
too-long terms they can use LengthFilter for that. There is no need to have 
such complex stuff i the ngram filters itself.

Furthermore I still think we need to address the traps I mentioned about about 
these filters emitting too many tokens already before we then go and add an 
option to make them produce even more...

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-04 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463930#comment-16463930
 ] 

Shawn Heisey commented on LUCENE-7960:
--

On first blush, an enum seems even more of a mess than one or two extra boolean 
parameters.  I will let [~rcmuir] and others with more experience in these 
matters make the call on that.  How would the factory (and by extension, text 
configuration files) handle it?

If two boolean parameters is going to meet with resistance, I can support 
preserveOriginal.  I think users will want long and short handled separately, 
but one flag would get the job done.


> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-03 Thread Ingomar Wesp (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461976#comment-16461976
 ] 

Ingomar Wesp commented on LUCENE-7960:
--

I understand your concern. As far as I can tell, there are two options:

1) Replace the booleans with an enum that covers the four possible 
combinations. Maybe "keepMode" with values "DROP", "KEEP_SHORT_TERM", 
"KEEP_LONG_TERM", "KEEP_ALL". I'm not really happy with the names, though - 
advice welcome.

2) Fold the two booleans into one - maybe "preserveOriginal", akin to how the 
corresponding attribute in other filters is called.

I personally prefer 1), but I'd happily adapt the patch to implement the other 
if it makes things easier from your perspective.



> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-02 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461792#comment-16461792
 ] 

Shawn Heisey commented on LUCENE-7960:
--

That idea had nothing to do with the number of booleans.  Only with making any 
extra arguments (no matter how many there are) optional.

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461782#comment-16461782
 ] 

Robert Muir commented on LUCENE-7960:
-

Sorry, varargs are completely uncalled for here. Arguing for 250 booleans 
instead of just 1 boolean isn't going to work as a "negotiating" strategy to 
get back to 2. Please take my recommendations seriously.

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-02 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461777#comment-16461777
 ] 

Shawn Heisey commented on LUCENE-7960:
--

The one thing that I do not know is whether an added argument with the ellipsis 
notation preserves API compatibility.  If we did that, would a program 
originally compiled against an older Lucene version still work correctly with 
the added parameter?

I know that everything would be fine if the program were re-compiled.  Which I 
think technically meets our overall goal for a minor release, but preserving 
binary compatibility when possible is a good bonus.

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-02 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461767#comment-16461767
 ] 

Shawn Heisey commented on LUCENE-7960:
--

An example of where I used the ellipsis notation in my own code to make a 
boolean argument optional:

{code:java}
/**
 * Fully close a connection, statement, and result set, ignoring any 
errors
 * that occur. Any of the three resources here can be null, but at 
least one
 * of them must NOT be null.
 * 
 * @param rs the ResultSet to close.
 * @param st the Statement to close.
 * @param conn The Connection to close.
 * @param forceFlags This odd ellipsis parameter is used for one thing
 *currently: a flag to indicate whether or not a close will 
be
 *forced on all provided resources even if everything 
doesn't
 *match up. In theory, a statement derived from the 
resultset
 *and connections derived from either one should be exactly 
the
 *same object as the ones provided to the method. If the 
flag is
 *false, then only the first non-null resource provided and 
any
 *parent resources derived from that resource will be 
closed. If
 *it is true, ALL resources including derived resources 
will be
 *closed. Mismatches will be logged either way. The ellipsis
 *notation is so that this parameter is optional. If 
omitted, it
 *will default to false.
 * @throws IllegalArgumentException if all three resources are null.
 */
public static void fullQuietClose(ResultSet rs, Statement st, 
Connection conn,
boolean... forceFlags)
{code}

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-02 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461757#comment-16461757
 ] 

Shawn Heisey commented on LUCENE-7960:
--

I just thought of a particularly ugly idea that would preserve the current 
3-arg capability *and* allow the extra booleans.  Make the constructor 
signature this:

{code}
  public EdgeNGramTokenFilter(
  TokenStream input, int minGram, int maxGram, boolean... flags)) {
{code}

I sometimes do things like this in my own code with methods that nobody else is 
going to use.  But for a public API like Lucene, is that as bad an idea as it 
seems?


> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-02 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461700#comment-16461700
 ] 

Shawn Heisey commented on LUCENE-7960:
--

The "obvious" workaround to either situation is to decrease minGram and/or 
increase maxGram.  I find that increasing maxGram doesn't meet with a lot of 
resistance ... but decreasing minGram can lead to massive term explosion (with 
possible performance ramifications) and a big shift in recall/precision balance.

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-02 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461695#comment-16461695
 ] 

Shawn Heisey commented on LUCENE-7960:
--

My original idea would have been handled by one boolean -- keeping terms 
shorter than minGram.  On more than one occasion, I've fielded questions where 
it turns out the user is trying to search for terms shorter than their minGram 
size.

In discussing it, the notion of *long* terms being removed by the min/max range 
also came up.  It was an idea I had not originally considered, but I have 
encountered someone since where they had ngram on the index side but not the 
query side, and wanted to search for terms longer than their maxGram size.

It could be reduced to one "keep" boolean to keep both short and long terms, 
but I think we're going to have people who want to keep short terms but not 
long terms, and vice versa.


> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461683#comment-16461683
 ] 

Robert Muir commented on LUCENE-7960:
-

my biggest concern is that these filters would then have two ctors:

* NGramTokenFilter(TokenStream)
* NGramTokenFilter(TokenStream, int, int, boolean, boolean)

The no-arg one starts looking more attractive to users at this point, and its 
mega-trappy (n=1,2)!!! That's the ctor that should be deprecated :)

In general I'll be honest, I don't like how trappy the apis are with these 
filters/tokenizers because of defaults like that. I also think its trappy they 
take a min and a max at all, because that's really creating (max-min) indexed 
fields all unioned into one. There aren't even any warnings about this. 

I haven't reviewed what the booleans of the patch does, but I am concerned that 
the use case may just be "keep original" which could be one boolean, or perhaps 
done in a different way entirely (e.g. KeywordRepeatFilter or perhaps something 
like LUCENE-8273). So if its acceptable to collapse it into one boolean that 
does that, I think that would be easier.

I feel like any defaults that our apis lead to (and when you have multiple 
ctors, then thats a default) should be something that will perform and scale 
well and work for the general case. For example n=4 has been shown to work well 
in many relevance experiments. At least we should make it easy for you to 
explicitly ask for something like that without passing many parameters.


> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-02 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461673#comment-16461673
 ] 

Shawn Heisey commented on LUCENE-7960:
--

Updated patch.  Does not deprecate constructors, does not fiddle with 
constructor usage in non-test code.

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-02 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461660#comment-16461660
 ] 

Shawn Heisey commented on LUCENE-7960:
--

[~rcmuir] so you would keep the current constructor around permanently?  I have 
no objection if that's what you'd prefer.

[~iwesp] You did preserve it.  When I looked at the patch (not the result of 
applying the patch) it looked like a replacement, which prompted that comment.


> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461514#comment-16461514
 ] 

Robert Muir commented on LUCENE-7960:
-

The patch doesn't add up to me. The description of this issue claims that the 
default behavior wouldn't be changed, but then the patch does just the opposite 
and makes the new parameters mandatory. 5 arguments is too many here, that's 
not usable IMO.

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-02 Thread Ingomar Wesp (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461325#comment-16461325
 ] 

Ingomar Wesp commented on LUCENE-7960:
--

Thanks a lot for your support. I don't quite understand your comment regarding 
the constructors: Unless I'm missing something, I think I _did_ preserve the 
original ones, which now delegate to the new ctors using defaullt values.

Is there anything left that I can or should do to get this into master?

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-01 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460312#comment-16460312
 ] 

Shawn Heisey commented on LUCENE-7960:
--

Updated patch added.  Deprecates the existing 3-arg constructors, and removes 
all usage of the deprecated constructors from the codebase.  Tests in 
lucene/analysis and precommit at the root are passing.


> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-01 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460304#comment-16460304
 ] 

Shawn Heisey commented on LUCENE-7960:
--

Applying the PR as-is does seem to work.  All the tests are passing.  I'm 
working on some minor alterations.  I've got precommit running, so far it looks 
good.

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-01 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460226#comment-16460226
 ] 

Shawn Heisey commented on LUCENE-7960:
--

I have basically come to the conclusion that I have absolutely no idea how this 
stuff works and cannot make any sense out of what the patch does, or even what 
the classes are doing *before* the modifications.

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-01 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459873#comment-16459873
 ] 

Shawn Heisey commented on LUCENE-7960:
--

I've gotten a look at the PR.

Changing the signature on an existing constructor isn't a good idea.  Lucene is 
a public API and there will be user code using that constructor that must 
continue to work if Lucene is upgraded.  We should add a new constructor and 
have the existing constructor(s) call that one with default values.

The only question about that is whether the existing constructor should be 
deprecated in stable and removed in master.  I'm not sure who to ask.

There are some variable renames.  They don't look like problems, especially 
because the visibility is private, but I'd like to get the opinion of someone 
who has deeper Lucene knowledge.

I'm having a difficult time following the modifications to the filter logic.  
Some of the modifications look like they're not directly related to 
implementing this issue, but I can't tell for sure.


> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-04-25 Thread Ingomar Wesp (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453005#comment-16453005
 ] 

Ingomar Wesp commented on LUCENE-7960:
--

I've just updated the patch in PR #362. I now also have a working patch for 
NGramTokenizer and EdgeNGramTokenizer in a separate branch, but it's still 
pretty messy and thus not yet ready for a PR.

Since this issue deals with the filters specifically, could someone could have 
a look at PR #362 and merge it if it's acceptable? Once this is done, I would 
then open another issue for the tokenizers.

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-04-04 Thread Ingomar Wesp (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426221#comment-16426221
 ] 

Ingomar Wesp commented on LUCENE-7960:
--

Ok, I just added the same paramteters to the NGramTokenFilter and updated the 
pull request. In the long term, it probably makes sense to move all the logic 
into the NGramTokenFilter and turn EdgeNGramTokenFilter into a simple wrapper. 
EdgeNGramTokenizer is already implemented this way.

I presume it also makes sense to extend NGramTokenizer and EdgeNGramTokenizer 
accordingly?

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-04-01 Thread Ingomar Wesp (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421621#comment-16421621
 ] 

Ingomar Wesp commented on LUCENE-7960:
--

Thanks for your feedback! Yes, this is exactly the behavior I tried to 
implement. I'll rename the parameters according to your suggestion.

I will also have a look at the other ngram components as soon as I have time.

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-03-31 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421479#comment-16421479
 ] 

Shawn Heisey commented on LUCENE-7960:
--

When I created this issue, I didn't think about long terms.  Somebody probably 
needs the functionality.

On further reflection, I don't think that new parameter names should be plural. 
 Using "keepShortTerm" and "keepLongTerm" sounds better to me.  They could both 
be enabled if that's what the user wants.  The same options should be added to 
all ngram analysis components, not just EdgeNgramFilter.

Let's consider a min length of 4 and a max length of 6, using EdgeNgramFilter.

If the input term is "abcdefgh", here's the basic term list from the filter:

abcd abcde abcdef

If keepShortTerm is enabled with the longer input, there is no change.  If 
keepLongTerm is enabled with the longer input, then the term list will be:

abcd abcde abcdef abcdefgh

The seven-character string would not be created.  If that's what the user 
wants, they should just increase the max value, rather than enable the new 
option.

If the input term is "ab", then the filter would not normally produce any 
terms.  With keepShortTerm, the output would be the input -- "ab".  The 
three-character term would not be produced.  If the user wants that, they would 
need to reduce the min value.

The keepLongTerm option would have no effect with a short input.  

I did glance at the patch, but didn't examine it in detail, so I don't know if 
it does what I just described or not.

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-03-31 Thread Ingomar Wesp (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421475#comment-16421475
 ] 

Ingomar Wesp commented on LUCENE-7960:
--

I'd like to propose a patch (see attached pull request #349) that adds two 
options to the EdgeNGramFilter:
 * keepShortTerms: Causes the filter to pass through input terms that are 
shorter than the minimum gram size.
 * keepLongTerms: Causes the filter to pass through input terms that are longer 
than the maximum gram size.

I'm not entirely sure about the usefulness of keepLongTerms, but enabling the 
ability pass through short terms would certainly be neat for queries where 
you'd like to match ALL tokens as either prefixes or exact terms, but some 
query tokens are shorter than the minimum gram size. As far is I understand, a 
second field containing the exact terms isn't really a viable alternative 
there, because you can easily run into situations where only a subset of query 
tokens matches for either field.

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org