[jira] [Commented] (LUCENE-8265) WordDelimiterFilter should pass through terms marked as keywords

2018-04-28 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457626#comment-16457626
 ] 

ASF subversion and git services commented on LUCENE-8265:
-

Commit d1bf6ad79a862b48c20d3197d58e6b5eefde519c in lucene-solr's branch 
refs/heads/branch_7x from Michael Sokolov
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d1bf6ad ]

LUCENE-8265: WordDelimiter*Filter ignores keywords


> WordDelimiterFilter should pass through terms marked as keywords
> 
>
> Key: LUCENE-8265
> URL: https://issues.apache.org/jira/browse/LUCENE-8265
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This will help in cases where some terms containing separator characters 
> should be split, but others should not.  For example, this will enable a 
> filter that identifies things that look like fractions and identifies them as 
> keywords so that 1/2 does not become 12, while doing splitting and joining on 
> terms that look like part numbers containing slashes, eg something like 
> "sn-999123/1" might sometimes be written "sn-999123-1".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8265) WordDelimiterFilter should pass through terms marked as keywords

2018-04-28 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457620#comment-16457620
 ] 

ASF subversion and git services commented on LUCENE-8265:
-

Commit fc0878cc2f97fdaa5206796ca5e0efa4988e7609 in lucene-solr's branch 
refs/heads/master from Michael Sokolov
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fc0878c ]

LUCENE-8265: WordDelimiter*Filter ignores keywords


> WordDelimiterFilter should pass through terms marked as keywords
> 
>
> Key: LUCENE-8265
> URL: https://issues.apache.org/jira/browse/LUCENE-8265
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This will help in cases where some terms containing separator characters 
> should be split, but others should not.  For example, this will enable a 
> filter that identifies things that look like fractions and identifies them as 
> keywords so that 1/2 does not become 12, while doing splitting and joining on 
> terms that look like part numbers containing slashes, eg something like 
> "sn-999123/1" might sometimes be written "sn-999123-1".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8265) WordDelimiterFilter should pass through terms marked as keywords

2018-04-24 Thread Mike Sokolov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449981#comment-16449981
 ] 

Mike Sokolov commented on LUCENE-8265:
--

[~romseygeek] yes, I could use that!  

> WordDelimiterFilter should pass through terms marked as keywords
> 
>
> Key: LUCENE-8265
> URL: https://issues.apache.org/jira/browse/LUCENE-8265
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This will help in cases where some terms containing separator characters 
> should be split, but others should not.  For example, this will enable a 
> filter that identifies things that look like fractions and identifies them as 
> keywords so that 1/2 does not become 12, while doing splitting and joining on 
> terms that look like part numbers containing slashes, eg something like 
> "sn-999123/1" might sometimes be written "sn-999123-1".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8265) WordDelimiterFilter should pass through terms marked as keywords

2018-04-24 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449896#comment-16449896
 ] 

Alan Woodward commented on LUCENE-8265:
---

I created LUCENE-8273 for the potential spinoff - [~sokolov] would this work 
for your situation?

> WordDelimiterFilter should pass through terms marked as keywords
> 
>
> Key: LUCENE-8265
> URL: https://issues.apache.org/jira/browse/LUCENE-8265
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This will help in cases where some terms containing separator characters 
> should be split, but others should not.  For example, this will enable a 
> filter that identifies things that look like fractions and identifies them as 
> keywords so that 1/2 does not become 12, while doing splitting and joining on 
> terms that look like part numbers containing slashes, eg something like 
> "sn-999123/1" might sometimes be written "sn-999123-1".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8265) WordDelimiterFilter should pass through terms marked as keywords

2018-04-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449819#comment-16449819
 ] 

Michael McCandless commented on LUCENE-8265:


Thanks [~sokolov]; new PR looks great.

> WordDelimiterFilter should pass through terms marked as keywords
> 
>
> Key: LUCENE-8265
> URL: https://issues.apache.org/jira/browse/LUCENE-8265
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This will help in cases where some terms containing separator characters 
> should be split, but others should not.  For example, this will enable a 
> filter that identifies things that look like fractions and identifies them as 
> keywords so that 1/2 does not become 12, while doing splitting and joining on 
> terms that look like part numbers containing slashes, eg something like 
> "sn-999123/1" might sometimes be written "sn-999123-1".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8265) WordDelimiterFilter should pass through terms marked as keywords

2018-04-24 Thread Mike Sokolov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449744#comment-16449744
 ] 

Mike Sokolov commented on LUCENE-8265:
--

I updated the pull request, adding a new flag, IGNORE_KEYWORDS, that gates
this feature.

On Mon, Apr 23, 2018 at 11:52 AM, David Smiley (JIRA) 



> WordDelimiterFilter should pass through terms marked as keywords
> 
>
> Key: LUCENE-8265
> URL: https://issues.apache.org/jira/browse/LUCENE-8265
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This will help in cases where some terms containing separator characters 
> should be split, but others should not.  For example, this will enable a 
> filter that identifies things that look like fractions and identifies them as 
> keywords so that 1/2 does not become 12, while doing splitting and joining on 
> terms that look like part numbers containing slashes, eg something like 
> "sn-999123/1" might sometimes be written "sn-999123-1".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8265) WordDelimiterFilter should pass through terms marked as keywords

2018-04-23 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448339#comment-16448339
 ] 

David Smiley commented on LUCENE-8265:
--

bq.  code up a TokenFilter that wraps another TokenFilter, and bypasses the 
wrapped filter if a certain condition is met?

Yes; this has been my very long Lucene/Solr idea TODO list.   Or perhaps 
alternatively, some TokenFilters could extend a new TokenFilter subclass that 
checks a condition.   By default it could be a Predicate that simply returns 
true.  This would address Mike Sokolov's concern on propagating the lifecycle 
calls... I've had to delegate a tokenizer/filter before and it was a bit 
annoying to get right.

> WordDelimiterFilter should pass through terms marked as keywords
> 
>
> Key: LUCENE-8265
> URL: https://issues.apache.org/jira/browse/LUCENE-8265
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This will help in cases where some terms containing separator characters 
> should be split, but others should not.  For example, this will enable a 
> filter that identifies things that look like fractions and identifies them as 
> keywords so that 1/2 does not become 12, while doing splitting and joining on 
> terms that look like part numbers containing slashes, eg something like 
> "sn-999123/1" might sometimes be written "sn-999123-1".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8265) WordDelimiterFilter should pass through terms marked as keywords

2018-04-23 Thread Mike Sokolov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448257#comment-16448257
 ] 

Mike Sokolov commented on LUCENE-8265:
--

Good point [~khitrin]. I agree, we should add an option. I'll post a change 
later today I hope.

[~romseygeek] that idea is really cool, but I think it is more complexity than 
I want to take on for this issue? I have tried making a wrapping filter before, 
and it is pretty tricky. In my experience you have to be very careful about (1) 
 how reset() calls propagate, and (2) the signal to switch behavior.  EG you 
probably want to be able to call incrementToken() on either of two different 
upstream filters that both share the same input, based on the value of some 
attribute. However, by the time you see the attribute, it is already too late 
to change! So you have to introduce a delay token that only carries this 
"switching" info. 

> WordDelimiterFilter should pass through terms marked as keywords
> 
>
> Key: LUCENE-8265
> URL: https://issues.apache.org/jira/browse/LUCENE-8265
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This will help in cases where some terms containing separator characters 
> should be split, but others should not.  For example, this will enable a 
> filter that identifies things that look like fractions and identifies them as 
> keywords so that 1/2 does not become 12, while doing splitting and joining on 
> terms that look like part numbers containing slashes, eg something like 
> "sn-999123/1" might sometimes be written "sn-999123-1".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8265) WordDelimiterFilter should pass through terms marked as keywords

2018-04-23 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448145#comment-16448145
 ] 

Alan Woodward commented on LUCENE-8265:
---

I wonder if there's a better way of handling this than using the 
KeywordAttribute, which as Nikolay says is heavily overloaded.  Would it be 
possible to somehow code up a TokenFilter that wraps another TokenFilter, and 
bypasses the wrapped filter if a certain condition is met?

> WordDelimiterFilter should pass through terms marked as keywords
> 
>
> Key: LUCENE-8265
> URL: https://issues.apache.org/jira/browse/LUCENE-8265
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This will help in cases where some terms containing separator characters 
> should be split, but others should not.  For example, this will enable a 
> filter that identifies things that look like fractions and identifies them as 
> keywords so that 1/2 does not become 12, while doing splitting and joining on 
> terms that look like part numbers containing slashes, eg something like 
> "sn-999123/1" might sometimes be written "sn-999123-1".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8265) WordDelimiterFilter should pass through terms marked as keywords

2018-04-23 Thread Nikolay Khitrin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448135#comment-16448135
 ] 

Nikolay Khitrin commented on LUCENE-8265:
-

This is the breaking change.

For example keyword attribute can be used for bypass stemming (as mentioned inĀ 
KeywordAttribute javadoc) _after_ WordDelimiterFilter.

Should be at least marked as breaking in changelog. Might be better solution is 
to provide this as an option for delimiter filter.

> WordDelimiterFilter should pass through terms marked as keywords
> 
>
> Key: LUCENE-8265
> URL: https://issues.apache.org/jira/browse/LUCENE-8265
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This will help in cases where some terms containing separator characters 
> should be split, but others should not.  For example, this will enable a 
> filter that identifies things that look like fractions and identifies them as 
> keywords so that 1/2 does not become 12, while doing splitting and joining on 
> terms that look like part numbers containing slashes, eg something like 
> "sn-999123/1" might sometimes be written "sn-999123-1".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8265) WordDelimiterFilter should pass through terms marked as keywords

2018-04-23 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447980#comment-16447980
 ] 

Michael McCandless commented on LUCENE-8265:


Thanks [~sokolov]; the PR looks great; I'll wait a day or so and then push!

> WordDelimiterFilter should pass through terms marked as keywords
> 
>
> Key: LUCENE-8265
> URL: https://issues.apache.org/jira/browse/LUCENE-8265
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This will help in cases where some terms containing separator characters 
> should be split, but others should not.  For example, this will enable a 
> filter that identifies things that look like fractions and identifies them as 
> keywords so that 1/2 does not become 12, while doing splitting and joining on 
> terms that look like part numbers containing slashes, eg something like 
> "sn-999123/1" might sometimes be written "sn-999123-1".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8265) WordDelimiterFilter should pass through terms marked as keywords

2018-04-22 Thread Mike Sokolov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447374#comment-16447374
 ] 

Mike Sokolov commented on LUCENE-8265:
--

Pull request here: https://github.com/apache/lucene-solr/pull/359 .. I haven't 
used this workflow much, so please let me know if there is a way to make it 
easier for committers to check out this change.

> WordDelimiterFilter should pass through terms marked as keywords
> 
>
> Key: LUCENE-8265
> URL: https://issues.apache.org/jira/browse/LUCENE-8265
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This will help in cases where some terms containing separator characters 
> should be split, but others should not.  For example, this will enable a 
> filter that identifies things that look like fractions and identifies them as 
> keywords so that 1/2 does not become 12, while doing splitting and joining on 
> terms that look like part numbers containing slashes, eg something like 
> "sn-999123/1" might sometimes be written "sn-999123-1".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org