[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-07-15 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13709450#comment-13709450
 ] 

Jack Krupansky commented on LUCENE-4766:


I just happened to notice that the underlying token filter accepts a list of 
patterns, but the factory only accepts a single pattern.

Was this intentional or an oversight?

In fact, the main example for the filter requires multiple patterns, which the 
factory does not support.


 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 5.0, 4.4

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch, LUCENE-4766.patch, 
 LUCENE-4766.patch, LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-04-24 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13640292#comment-13640292
 ] 

Adrien Grand commented on LUCENE-4766:
--

+1

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.3

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch, LUCENE-4766.patch, 
 LUCENE-4766.patch, LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-04-24 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13640314#comment-13640314
 ] 

Commit Tag Bot commented on LUCENE-4766:


[trunk commit] simonw
http://svn.apache.org/viewvc?view=revisionrevision=1471347

LUCENE-4766: Added a PatternCaptureGroupTokenFilter that uses Java regexes to 
emit multiple tokens one for each capture group in one or more patterns

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.3

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch, LUCENE-4766.patch, 
 LUCENE-4766.patch, LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-04-24 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13640318#comment-13640318
 ] 

Commit Tag Bot commented on LUCENE-4766:


[branch_4x commit] simonw
http://svn.apache.org/viewvc?view=revisionrevision=1471352

LUCENE-4766: Added a PatternCaptureGroupTokenFilter that uses Java regexes to 
emit multiple tokens one for each capture group in one or more patterns

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.3

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch, LUCENE-4766.patch, 
 LUCENE-4766.patch, LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-13 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13577469#comment-13577469
 ] 

Adrien Grand commented on LUCENE-4766:
--

bq. I tend to disagree that this should be a tokenizer.

Maybe another option is to change this filter so that it doesn't change 
offsets? Let's imagine this filter is used to break {{TokenFilter}} into 
{{Token}}, {{TokenFilter}} and {{Filter}}, I think it's acceptable to highlight 
{{TokenFilter}} as a whole when searching for Token or Filter?

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13577512#comment-13577512
 ] 

Robert Muir commented on LUCENE-4766:
-

The patch should add a test that uses checkRandomData.

it would find bugs like this:
{code}
charOffsetStart = offsetAttr.startOffset();
charOffsetEnd = charOffsetStart + spare.length;
{code}

charOffsetEnd should be offsetAttr.endOffset(). If there is a charfilter, the 
current calculation will be incorrect.

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch, LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-13 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13577514#comment-13577514
 ] 

Simon Willnauer commented on LUCENE-4766:
-

Clinton, I think you can trash the offset attribute reference in there entirely 
just don't mess with them at all. Can you also call 
BaseTokenStreamTest#assertAnalyzesToReuse and BTST#checkRandomData in your 
tests please

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch, LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-13 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13577556#comment-13577556
 ] 

Uwe Schindler commented on LUCENE-4766:
---

bq. Clinton, I think you can trash the offset attribute reference in there 
entirely just don't mess with them at all.

That's part of a bigger problem in the current code. The idea of this filter is 
to make from one input Token multiple output Tokens. To make this work correct, 
the *new* output tokens must be produced based on the original token (means the 
filter must reset the new produced token to a clean state, otherwise it might 
happen that unrelated and unknown attributes stay alive with wrong values - 
especiall if later TokenFilter change attributes, e.g. a Synonymfilter is 
inserting more synonyms). The problem Clinton had was that he had to re-set the 
offset attribute (although he does not change it); but he missed possible other 
attributes on the stream he does not know about.

If you look at other filters doing similar things like Synonymfilter, WDF, the 
way it has to work is like that:
- The first token emmitted is the original one, maybe modified
- All inserted Tokens are cloned from the original (first) token, use 
captureState/restoreState to do that. This will initialize the attribute source 
to the exact same token like the original (unmodified one). After you called 
restoreState, you can *modify* the attribute (like term text) and 
setPositionIncrement(0). You can then leave the the offset (and other unknown 
attributes that may be on the token stream) unchanged - don't reference them at 
all.

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch, LUCENE-4766.patch, 
 LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-11 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575714#comment-13575714
 ] 

Simon Willnauer commented on LUCENE-4766:
-

Hey Clinton, this looks very interesting and given this is your first java 
experience pretty impressive too. I am not sure how expensive this filter in 
practice will be but given that you can do stuff you can't do with any of the 
other filters I think folks just have to pay the price here. I like that all 
patterns operate on the same CharSequence and that you are setting offsets 
right. Cool stuff! 

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-11 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575718#comment-13575718
 ] 

Adrien Grand commented on LUCENE-4766:
--

bq. I like that all patterns operate on the same CharSequence and that you are 
setting offsets right.

Does this filter need to set offsets? I'm worried that under certain 
circumstances filters that modify offsets might create inconsistent offset 
graphs (because they don't know what filters have been applied before, there is 
an exclusion list for filters that modify offsets in TestRandomChains).

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-11 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575731#comment-13575731
 ] 

Simon Willnauer commented on LUCENE-4766:
-

bq. Does this filter need to set offsets? I'm worried that under certain 
circumstances filters that modify offsets might create inconsistent offset 
graphs (because they don't know what filters have been applied before, there is 
an exclusion list for filters that modify offsets in TestRandomChains).

yeah I agree offsets are tricky here. I just wonder if we really should 
restrict our TF to not fix offsets? Kind of an odd thing though. What should a 
tokenfilter like this do instead?

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-11 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575739#comment-13575739
 ] 

Adrien Grand commented on LUCENE-4766:
--

bq. I just wonder if we really should restrict our TF to not fix offsets? Kind 
of an odd thing though. What should a tokenfilter like this do instead?

I think that for some examples, it makes sense not to fix offsets? In the case 
of the URL example ({{(https?://([a-zA-Z\-_0-9.]+))}}), I think it makes sense 
to highlight the whole URL (including the leading http(s)://) even if the query 
term is just {{www.mysite.com}}. On the other hand, it could be weird if the 
goal was to split a long CamelCase token (letsPartyLIKEits1999_dude), but maybe 
this should be done by a Tokenizer rather than a TokenFilter?

(No strong feeling here, I'd just like to see if we can find a way to commit 
this patch without having to grow our TokenFilter exclusion list.)

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575760#comment-13575760
 ] 

Robert Muir commented on LUCENE-4766:
-

{quote}
(No strong feeling here, I'd just like to see if we can find a way to commit 
this patch without having to grow our TokenFilter exclusion list.)
{quote}

I dont think tokenfilters should change offsets in general. This is not 
possible to do correctly. In general if you are splitting and creating like 
this... its a tokenizer not a tokenfilter. And only a tokenizer can set 
offsets, because its the only one that has access to the charfilter correction 
data.

besides, trying to be a token 'creator' over an incoming tokenstream graph is 
really hard to get right.

So I would prefer if this filter either became a tokenizer or did not change 
offsets at all. Then we can probably get it committed without hassle.

we cannot allow this exclusion list to grow. Its not an exclusion list that 
says 'its ok to add more broken filters'. Its a list of filters that will get 
deleted from lucene soon unless someone fixes them, because we have to stop 
indexwriter from writing invalid data into the term vectors here.

Also the test should call checkRandomData() :)


 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-11 Thread Clinton Gormley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575774#comment-13575774
 ] 

Clinton Gormley commented on LUCENE-4766:
-

Is it OK for a tokenizer to create multiple tokens in the same positions but 
with different offsets? eg 

{code}
foobar - [foobar,foo,bar] with positions [1,1,1] and startOffsets 
[0,0,3]?
{code}

Having a look at the wordDelimiter token filter, with preserveOriginal set to 
true, it increments the position for each new offset, eg: 

{code}
fooBar - [fooBar,foo,Bar] with positions [1,1,2] and startOffsets 
[0,0,3].
{code}

I'm asking because I'm not sure exactly how positions and offsets get used 
elsewhere, and so what the correct behaviour is. From my naive understanding, 
the wordDelimiter filter can produce spurious results with phrase searches, eg 
matching fooBar Bar

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575776#comment-13575776
 ] 

Robert Muir commented on LUCENE-4766:
-

The positions are used for searching, offsets for highlighting.

So you can (unfortunately) set the offsets to whatever you want, it wont affect 
searches. Instead it will only cause problems for highlighting. An example of 
this is: https://issues.apache.org/jira/browse/SOLR-4137

For a tokenfilter, it doesnt make sense to change offsets, because a tokenizer 
already broke the document into words and mapped them back to their original 
location in the document.

If a tokenfilter REALLY needs to change offsets, then its a sign its 
subclassing the wrong analysis type and should be a tokenizer: because its 
trying to break the document into words, not just alter existing tokenization :)


 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-11 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575786#comment-13575786
 ] 

Adrien Grand commented on LUCENE-4766:
--

bq. Is it OK for a tokenizer to create multiple tokens in the same positions 
but with different offsets?

Although it's not common, it is perfectly fine for a Tokenizer to generate 
multiple tokens in the same position.

However, I think the correct way to tokenize your example would be:
{code}
tokens: foo, foobar, bar
positions: 1, 1, 2
position lengths: 1, 2, 1
start offsets: 0, 0, 3
end offsets: 3, 6, 6
{code}

I'm not sure WordDelimiterFilter is the best example to look at. I'm not 
familiar with it at all, but it's currently in the exclusion list for both 
positions and offsets (and is the culprit for SOLR-4137).

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-11 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575793#comment-13575793
 ] 

Uwe Schindler commented on LUCENE-4766:
---

WDF is one of those bad examples. WDF should be included in a custom 
Tokenizer. WDF is always used together with WhitespaceTokenizer, so it should 
be included into WhitespaceTokenizer.

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-11 Thread Clinton Gormley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575811#comment-13575811
 ] 

Clinton Gormley commented on LUCENE-4766:
-

OK, so I should redo this as a tokenizer, and set positionLengths correctly.

One issue is that, because there are multiple patterns, the emitted tokens can 
overlap, eg:

{code}
   foobarbaz - foo, foobar, oba, bar, baz
{code}

in which case I think I would need to emit:

{code}
positions: 1, 1, 2, 3, 5
position lengths:  2, 4, 2, 2, 1
start offsets: 0, 0, 0, 0, 0
end offsets:   3, 6, 3, 3, 3
{code}

Is this correct? It's starting to look quite complex...

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-11 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575813#comment-13575813
 ] 

Simon Willnauer commented on LUCENE-4766:
-

I tend to disagree that this should be a tokenizer. IMO a tokenizer should only 
split tokens in a stream fashion and should not emit tokens on the same 
position really. This is what token filters should do. It also clashes with 
reuseability since you can't really reuse tokenizers you have to decide which 
one you want. At some point you need to know what you are doing here really. I 
don't have a definite answer but there is currently no clean way to do what 
clinton wants to do IMO.

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4766) Pattern token filter which emits a token for every capturing group

2013-02-11 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575889#comment-13575889
 ] 

Shawn Heisey commented on LUCENE-4766:
--

I use WDF with ICUTokenizer.  ICUTokenizer is customized with an RBBI file for 
latin1 that only breaks on whitespace.

 Pattern token filter which emits a token for every capturing group
 --

 Key: LUCENE-4766
 URL: https://issues.apache.org/jira/browse/LUCENE-4766
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Clinton Gormley
Assignee: Simon Willnauer
Priority: Minor
  Labels: analysis, feature, lucene
 Fix For: 4.2

 Attachments: LUCENE-4766.patch, LUCENE-4766.patch


 The PatternTokenizer either functions by splitting on matches, or allows you 
 to specify a single capture group.  This is insufficient for my needs. Quite 
 often I want to capture multiple overlapping tokens in the same position.
 I've written a pattern token filter which accepts multiple patterns and emits 
 tokens for every capturing group that is matched in any pattern.
 Patterns are not anchored to the beginning and end of the string, so each 
 pattern can produce multiple matches.
 For instance a pattern like :
 {code}
 (([a-z]+)(\d*))
 {code}
 when matched against: 
 {code}
 abc123def456
 {code}
 would produce the tokens:
 {code}
 abc123, abc, 123, def456, def, 456
 {code}
 Multiple patterns can be applied, eg these patterns could be used for 
 camelCase analysis:
 {code}
 ([A-Z]{2,}),
 (?![A-Z])([A-Z][a-z]+),
 (?:^|\\b|(?=[0-9_])|(?=[A-Z]{2}))([a-z]+),
 ([0-9]+)
 {code}
 When matched against the string letsPartyLIKEits1999_dude, they would 
 produce the tokens:
 {code}
 lets, Party, LIKE, its, 1999, dude
 {code}
 If no token is emitted, the original token is preserved. 
 If the preserveOriginal flag is true, it will output the full original token 
 (ie letsPartyLIKEits1999_dude) in addition to any matching tokens (but in 
 this case, if a matching token is identical to the original, it will only 
 emit one copy of the full token).
 Multiple patterns are required to allow overlapping captures, but also means 
 that patterns are less dense and easier to understand.
 This is my first Java code, so apologies if I'm doing something stupid.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org