[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-15 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006170#comment-15006170
 ] 

Steve Rowe commented on LUCENE-6874:


bq. Thank for the fruitful discussions! I hope Steve Rowe is not unhappy 
because we did not use jflex for this simple case and instead use Unicode data 
with the already existing CharTokenizer.

No worries, +1 to your work Uwe.  You've also laid the groundwork for future 
simple Unicode property-based char tokenizers, which is nice.  Thanks!

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: Trunk, 5.4
>
> Attachments: LUCENE-6874-chartokenizer.patch, 
> LUCENE-6874-chartokenizer.patch, LUCENE-6874-chartokenizer.patch, 
> LUCENE-6874-jflex.patch, LUCENE-6874.patch, LUCENE_6874_jflex.patch, 
> icu-datasucker.patch, unicode-ws-tokenizer.patch, unicode-ws-tokenizer.patch, 
> unicode-ws-tokenizer.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005553#comment-15005553
 ] 

ASF subversion and git services commented on LUCENE-6874:
-

Commit 1714354 from [~thetaphi] in branch 'dev/trunk'
[ https://svn.apache.org/r1714354 ]

LUCENE-6874: Add a new UnicodeWhitespaceTokenizer to analysis/common that uses 
Unicode character properties extracted from ICU4J to tokenize text on whitespace

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Assignee: Uwe Schindler
>Priority: Minor
> Attachments: LUCENE-6874-chartokenizer.patch, 
> LUCENE-6874-chartokenizer.patch, LUCENE-6874-chartokenizer.patch, 
> LUCENE-6874-jflex.patch, LUCENE-6874.patch, LUCENE_6874_jflex.patch, 
> icu-datasucker.patch, unicode-ws-tokenizer.patch, unicode-ws-tokenizer.patch, 
> unicode-ws-tokenizer.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005563#comment-15005563
 ] 

ASF subversion and git services commented on LUCENE-6874:
-

Commit 1714355 from [~thetaphi] in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1714355 ]

Merged revision(s) 1714354 from lucene/dev/trunk:
LUCENE-6874: Add a new UnicodeWhitespaceTokenizer to analysis/common that uses 
Unicode character properties extracted from ICU4J to tokenize text on whitespace

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Assignee: Uwe Schindler
>Priority: Minor
> Attachments: LUCENE-6874-chartokenizer.patch, 
> LUCENE-6874-chartokenizer.patch, LUCENE-6874-chartokenizer.patch, 
> LUCENE-6874-jflex.patch, LUCENE-6874.patch, LUCENE_6874_jflex.patch, 
> icu-datasucker.patch, unicode-ws-tokenizer.patch, unicode-ws-tokenizer.patch, 
> unicode-ws-tokenizer.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-13 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004794#comment-15004794
 ] 

Uwe Schindler commented on LUCENE-6874:
---

If nobody objects, I will commit this tomorrow.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-chartokenizer.patch, 
> LUCENE-6874-chartokenizer.patch, LUCENE-6874-chartokenizer.patch, 
> LUCENE-6874-jflex.patch, LUCENE-6874.patch, LUCENE_6874_jflex.patch, 
> icu-datasucker.patch, unicode-ws-tokenizer.patch, unicode-ws-tokenizer.patch, 
> unicode-ws-tokenizer.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002248#comment-15002248
 ] 

Uwe Schindler commented on LUCENE-6874:
---

Here is the output of the reuters test:

{noformat}
> Report Sum By (any) Name and Round (28 about 33 out of 34)
Operationround   runCnt   
recsPerRunrec/s  elapsedSecavgUsedMemavgTotalMem
AnalyzerFactory(name:WhitespaceTokenizer,WhitespaceTokenizer(rule:java))
   010 0.000.00 9,569,344
124,256,256
AnalyzerFactory(name:UnicodeWhitespaceTokenizer,WhitespaceTokenizer(rule:unicode))
 -   0 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   0.00 -   9,569,344 -  
124,256,256
Rounds_5  01 
24493540   360,841.19   67.8816,566,472124,256,256
NewAnalyzer(WhitespaceTokenizer) -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
 -  - 0 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   0.00 -   9,569,344 -  
124,256,256
[Character.isWhitespace()] WhitespaceTokenizer  
   01  2449354   331,038.537.4022,121,256
124,256,256
Seq_2 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 0 -  -   2 -  - 
2449354 - 344,131.22 -  -  14.23 -  22,121,256 -  118,489,088
NewAnalyzer(UnicodeWhitespaceTokenizer) 
   010 0.000.0022,121,256
112,721,920
[UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer -  -  -  -  -  -  -  
-  -   0 -  -   1 -  - 2449354 - 358,302.22 -  -   6.84 -  22,121,256 -  
112,721,920
NewAnalyzer(WhitespaceTokenizer)
  110 0.000.0012,138,024
112,721,920
[Character.isWhitespace()] WhitespaceTokenizer -  -  -  -  -  -  -  -  -  -  -  
-  -   1 -  -   1 -  - 2449354 - 366,724.66 -  -   6.68 -  22,374,536 -  
112,721,920
Seq_2  12  
2449354   365,139.25   13.4227,477,352117,702,656
NewAnalyzer(UnicodeWhitespaceTokenizer) -  -  -  -  -  -  -  -  -  -  -  -  -  
-  -  - 1 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   0.00 -  22,374,536 -  
111,673,344
[UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer  
   11  2449354   363,567.476.7432,580,168
122,683,392
NewAnalyzer(WhitespaceTokenizer) -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
 -  - 2 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   0.00 -  32,580,168 -  
122,683,392
[Character.isWhitespace()] WhitespaceTokenizer  
   21  2449354   365,793.596.7033,461,280
122,683,392
Seq_2 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 2 -  -   2 -  - 
2449354 - 365,112.03 -  -  13.42 -  33,461,280 -  117,178,368
NewAnalyzer(UnicodeWhitespaceTokenizer) 
   210 0.000.0033,461,280
111,673,344
[UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer -  -  -  -  -  -  -  
-  -   2 -  -   1 -  - 2449354 - 364,432.97 -  -   6.72 -  33,461,280 -  
111,673,344
NewAnalyzer(WhitespaceTokenizer)
  310 0.000.0010,836,464
111,673,344
[Character.isWhitespace()] WhitespaceTokenizer -  -  -  -  -  -  -  -  -  -  -  
-  -   3 -  -   1 -  - 2449354 - 367,660.47 -  -   6.66 -  12,451,400 -  
111,673,344
Seq_2  32  
2449354   365,820.94   13.3913,235,672111,673,344
NewAnalyzer(UnicodeWhitespaceTokenizer) -  -  -  -  -  -  -  -  -  -  -  -  -  
-  -  - 3 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   0.00 -  12,451,400 -  
111,673,344
[UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer  
   31  2449354   363,999.696.7314,019,944
111,673,344
NewAnalyzer(WhitespaceTokenizer) -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
 -  - 4 -  -   1 -  -  -  - 0 -  -  - 0.00 -  -   0.00 -  14,019,944 -  
111,673,344
[Character.isWhitespace()] WhitespaceTokenizer  
   41  2449354   367,329.626.6715,061,368
111,673,344
Seq_2 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 4 -  -   2 -  - 
2449354 - 365,057.59 -  -  13.42 -  15,813,920 -  111,673,344
NewAnalyzer(UnicodeWhitespaceTokenizer) 
   410 0.000.0015,061,368
111,673,344
[UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer -  -  -  -  -  -  -  
-  -   4 -  -   1 -  - 2449354 - 362,813.50 -  -   6.75 

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-12 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002388#comment-15002388
 ] 

David Smiley commented on LUCENE-6874:
--

+1 Patch is good Uwe.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-chartokenizer.patch, 
> LUCENE-6874-chartokenizer.patch, LUCENE-6874-chartokenizer.patch, 
> LUCENE-6874-jflex.patch, LUCENE-6874.patch, LUCENE_6874_jflex.patch, 
> icu-datasucker.patch, unicode-ws-tokenizer.patch, unicode-ws-tokenizer.patch, 
> unicode-ws-tokenizer.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001688#comment-15001688
 ] 

David Smiley commented on LUCENE-6874:
--

+1 I like it Uwe; nice job.  Automating the generation of the unicode code 
points is good, and may come in handy if we want other unicode rules.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch, icu-datasucker.patch, unicode-ws-tokenizer.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001338#comment-15001338
 ] 

Uwe Schindler commented on LUCENE-6874:
---

Result when running:

{noformat}
unicode-tokenizers:
   [groovy] Unicode version: 7.0.0.0
   [groovy] Whitespace: 9, 10, 11, 12, 13, 28, 29, 30, 31, 32, 5760, 8192, 
8193, 8194, 8195, 8196, 8197, 8198, 8200, 8201, 8202, 8232, 8233, 8287, 12288
{noformat}

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch, icu-datasucker.patch, icu-datasucker.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001345#comment-15001345
 ] 

Uwe Schindler commented on LUCENE-6874:
---

Sorry my fault, must be UCharacter.isUWhitespace(), result is then:

{noformat}
   [groovy] Unicode version: 7.0.0.0
   [groovy] Whitespace: 9, 10, 11, 12, 13, 32, 133, 160, 5760, 8192, 8193, 
8194, 8195, 8196, 8197, 8198, 8199, 8200, 8201, 8202, 8232, 8233, 8239, 8287, 
12288
{noformat}

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch, icu-datasucker.patch, icu-datasucker.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001346#comment-15001346
 ] 

Steve Rowe commented on LUCENE-6874:


Uwe, you're using UCharacter,isWhitespace(), but that's the same as the 
problematic Java Character.isWhitespace() -- note the exclusion of U+00A0 in 
your output Whitespace char list -- 
http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UCharacter.html says what 
you want is {{isUWhiteSpace(c)}} or {{hasBinaryProperty(c, 
UProperty.WHITE_SPACE)}}

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch, icu-datasucker.patch, icu-datasucker.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001349#comment-15001349
 ] 

Uwe Schindler commented on LUCENE-6874:
---

Sorry updated my post, recognized this a minute ago. :-)

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch, icu-datasucker.patch, icu-datasucker.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001000#comment-15001000
 ] 

Steve Rowe commented on LUCENE-6874:


bq. My idea was to use a Unicode data file and extract all Whitespace 
characters in a build tool. Shipping with the usicode data file would be a 
large overhead.

The JFlex project has a similar requirement, but for many more properties than 
just Whitespace.  JFlex includes a Maven plugin used by the build that parses 
Unicode data files via (you guessed it) JFlex scanners - here's the JFlex spec 
for the parser for binary property data files, including {{PropList.txt}}, 
which holds the Whitespace property definition: 
https://github.com/jflex-de/jflex/blob/master/jflex-unicode-maven-plugin/src/main/jflex/BinaryPropertiesFileScanner.flex
 

Note: Unicode property names can have aliases, and "loose" matching is the 
recommended way to refer to them (see 
http://unicode.org/reports/tr18/#Categories ): match case-insensitively, and 
ignore whitespace, dashes, and underscores.  {{PropList.txt}} gives the 
Whitespace property name as {{White_Space}}, and {{PropertyAliases.txt}} lists 
{{WSpace}} and {{space}} as aliases.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000940#comment-15000940
 ] 

Uwe Schindler commented on LUCENE-6874:
---

bq. Why persist the bitset and deal with the build issues around that when 
instead it could be done in-memory in a static initializer? It's so cheap to 
build that I question the effort in pre-building it as part of the build 
process.

Where to get the information for the bitset from? The Unicode data as of 
java.lang.Character class is wrong! :-) My idea was to use a Unicode data file 
and extract all Whitespace characters in a build tool. Shipping with the 
usicode data file would be a large overhead.

bq. On 5.x, Uwe, Adrien, how do you feel about the WhitespaceTokenizerFactory I 
have in the patch with the "rule" attribute to pick?

I am not happy, but could live with it.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001257#comment-15001257
 ] 

Uwe Schindler commented on LUCENE-6874:
---

Cool!

So my idea would be to write a small tool in the analysis/commons/src/tools 
module that uses the current jflex JAR file in classpath and extract the values 
for the bitset from the JFlex JAR file by accessing this class: 
https://github.com/jflex-de/jflex/blob/master/jflex/src/main/java/jflex/unicode/data/Unicode_6_3.java

In my opinion this would be more elegant than creating a huge jflex 
UnicodeWhitespaceTokenizer in a separate submodule, if we could just just a 
codegenerator that produces a BitSet to be used in a CharTokenizer subclass. 
CharTokenizer is thoroughly tested, so just feeding it with a bitset would be 
my preference.

Would this work?

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001312#comment-15001312
 ] 

Steve Rowe commented on LUCENE-6874:


bq. My idea was to create the whitespace chars as int[] array for every unicode 
version

Each ICU4J release only has data for the (single) Unicode version it was built 
against, so it won't work for this purpose.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001283#comment-15001283
 ] 

Steve Rowe commented on LUCENE-6874:


bq. Would this work?

Yes, but I think ICU4J is more authoritative, and already effectively pins the 
Unicode version used in Lucene, so I'd recommend going with it instead of JFlex 
to extract characters with property X.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001292#comment-15001292
 ] 

Uwe Schindler commented on LUCENE-6874:
---

My idea was to create the whitespace chars as int[] array for every unicode 
version and allow WhitespaceTokenizer to specify the version. As we have the 
data files already, the same groovy code against jflex data (or ICU4J) could be 
used to build LetterTokenizer, too?

This would also allw backwards compatibility. By default WhitespaceTokenizer 
would use latest Unicode version, unless you specify a version (as enum 
constant).

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001307#comment-15001307
 ] 

Uwe Schindler commented on LUCENE-6874:
---

...hacking Groovy script using ICU4J as specified in ivy-versions.properties...

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-10 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998838#comment-14998838
 ] 

David Smiley commented on LUCENE-6874:
--

Uwe,
Why persist the bitset and deal with the build issues around that when instead 
it could be done in-memory in a static initializer?  It's so cheap to build 
that I question the effort in pre-building it as part of the build process.

On 5.x, Uwe, Adrien, how do you feel about the WhitespaceTokenizerFactory I 
have in the patch with the "rule" attribute to pick?

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-09 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997057#comment-14997057
 ] 

David Smiley commented on LUCENE-6874:
--

Sorry, I really disagree with you on this.  I don't think this 
WhitespaceTokenizerFactory is hard to maintain at all.  It's true that it's 
harder only because it was a trivial factory before but so what?  Most 
importantly, I think it's a better user experience -- nobody should care what 
the specific Java Tokenizer implementation class will be coming out of the 
factory -- it's a tokenizer on whitespace using whatever definition/rule of 
whitespace they configured.  That could hypothetically be implemented using one 
Java Tokenizer implementing class or multiple but that's an implementation 
detail.

bq. Why is the ICUWhitespace being added?

I'll remove that in a new patch; I wasn't sure what to do but it's redundant so 
no need for it.



> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-09 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998105#comment-14998105
 ] 

Uwe Schindler commented on LUCENE-6874:
---

I would be fine to remove WhitespaceTokenizer in Lucene trunk. In that case I 
would also like to move the abstract CharTokenizer class out of 
oal.analysis.util to oal.analysis.core. This is not a big deal, but the util 
package is not fine for a first class citizen.

I also have another idea about this issue; I would prefer to not have the large 
java code with jflex involved. Wouldn't it be possible to save the isWhitespace 
data of Unicode in a compressible Lucene bitset and save it to disk as resource 
file? We could then load the bitset (like deleted documents) from a resource 
file and wrap a simple {{CharTokenizer.fromSeparatorCharPredicate(c -> 
compressedBitset.get(c))}} on top? The bitset could be generated from Unicode 
data on "ant regenerate".

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-09 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997516#comment-14997516
 ] 

Uwe Schindler commented on LUCENE-6874:
---

Yeah remove it! LUCENE-6879 is enough to quickly define your own 
WhitespaceTokenizer with ICU, if you want.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-09 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997645#comment-14997645
 ] 

Adrien Grand commented on LUCENE-6874:
--

I tend to like Uwe's idea. I have often wondered what the actual use-cases of 
WhitespaceTokenizer were but did not suggest to remove it as the cost of 
maintenance was very low given its simplicity. However now that there is some 
controversy arising and given how simple it is to create character-based 
tokenizers in trunk {{Tokenizer tok = 
CharTokenizer.fromSeparatorCharPredicate(Character::isWhitespace);}}, maybe we 
should just remove this tokenizer and let users define it themselves with the 
more flexible {{CharTokenizer.fromSeparatorCharPredicate}}?

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-09 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1499#comment-1499
 ] 

David Smiley commented on LUCENE-6874:
--

Just for clarification, Adrien, are you suggesting that 
WhitespaceTokenizerFactory stays, but WhitespaceTokenizer get removed (because 
it's easy to define one)?  I'm +1 to that.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14995169#comment-14995169
 ] 

Robert Muir commented on LUCENE-6874:
-

The shared factory is confusing: this is supposed to be a simple thing. Now we 
have some parameters that depend on other parameters and so on. Please, make 
two factories. We have to be able to maintain this stuff.

Why is the ICUWhitespace being added? 


> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-03 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987423#comment-14987423
 ] 

Steve Rowe commented on LUCENE-6874:


bq. I just noticed that your patch contains hardcoded filenames of your local 
system, I think those are just leftovers from your testing with reuters.

Yup, patch needs cleanup before it can be committed.  I figured the decision 
about what to do hasn't been made yet, so I'll wait on doing that work until 
then.

bq. Otherwise I am not happy about the size of the generated files, but thats 
how jflex works...

I don't think the generated Java source makes much difference - the thing 
people will deal with is the JFlex source, and it's fairly compact.  I looked 
at the .class file sizes on my system, and I see 13k for the JFlex version and 
2k for the ICU version. 

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-03 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987455#comment-14987455
 ] 

David Smiley commented on LUCENE-6874:
--

Nice thorough job Steve!

I propose that we consolidate the TokenizerFactories here into one -- the 
existing WhitespaceTokenizerFactory.  I think this is more user friendly.  An 
attribute could select *which* whitespace definition the user wants:  "java" or 
"unicode".  What do you think?

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-03 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988290#comment-14988290
 ] 

Steve Rowe commented on LUCENE-6874:


bq. I propose that we consolidate the TokenizerFactories here into one – the 
existing WhitespaceTokenizerFactory. I think this is more user friendly. An 
attribute could select which whitespace definition the user wants: "java" or 
"unicode". What do you think?

Implicitly then, you're nixing ICUWhitespaceTokenizer, since it can't be in 
analyzers-common.

I'm okay with adding a param to WhitespaceTokenizerFactory (not sure what to 
name it though: "authority"/"style"/"definition"?) .  Since the default 
wouldn't change ("java" would be the default I assume), I don't think we need 
to introduce luceneMatchVersion. 

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-03 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988404#comment-14988404
 ] 

Steve Rowe commented on LUCENE-6874:


bq. My concern for Solr users is that NBSP occurs somewhat commonly in HTML web 
pages - as a formatting technique more than an attempt at influencing 
tokenization.

FYI, {{}} is converted to U+0020 by {{HTMLStripCharFilter}}.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-03 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988338#comment-14988338
 ] 

Jack Krupansky commented on LUCENE-6874:


Certainly Solr can update its example schemas to use whatever alternative 
tokenizer or option is decided on so that Solr users, many of whom are not Java 
developers, will no longer fall into this NBSP trap, but... that still feels 
like a less than desirable resolution.

[~thetaphi], could you elaborate more specifically on the existing use case 
that you are trying to preserve? I mean, like in terms of a real-world example. 
Where do some of your NBSPs actually live in the wild?

It seems to me that the vast majority of normal users would not be negatively 
impacted by having "white space" be defined using the Unicode model. I never 
objected to using the Java model, but that's because I had overlooked this 
nuance of NBSP. My concern for Solr users is that NBSP occurs somewhat commonly 
in HTML web pages - as a formatting technique more than an attempt at 
influencing tokenization.


> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-03 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988646#comment-14988646
 ] 

Jack Krupansky commented on LUCENE-6874:


bq. Because WST and WDF should really only be used as a last resort.

Absolutely agreed. From a Solr user perspective we really need a much simpler 
model for semi-standard tokens out of the box without the user having to 
scratch their heads and resorting to WST in the first (last) place. LOL - maybe 
if we could eliminate this need to resort to WST, we wouldn't have to fret as 
much about WST.

bq.  I generally suggest to my users to use ClassicTokenizer

Personally, I've always refrained from recommending CT since I thought ST was 
supposed to replace it and that the email and URL support was considered an 
excess not worth keeping. I've considered CT as if it were deprecated (which it 
is not.) And, I never see anybody else recommending it on the user list. And, 
the fact that it can't handle slashes for product number is a deal killer. I'm 
not sure that I would argue in favor of resurrecting CT as a first-class 
recommendation, especially since it can't handle non-European languages, but...

That said, I do think it is worth separately (from this Jira) considering a 
fresh, new tokenizer that starts with the goodness of ST and adds in an 
approximation of the reasons that people resort to WST. Whether that can be an 
option on ST or has to be a separate tokenizer would need to be debated. I'd 
prefer an option on ST, either to simply allow embedded special characters or 
to specify a list or regex of special character to be allowed or excluded.

People would still need to combine NewT with WDF, but at least the tokenization 
would be more explicit.

Personally I would prefer to see an option for whether to retain or strip 
external punctuation vs. embedded special characters. Trailing periods and 
commas and columns and enclosing parentheses are just the kinds of things we 
had to resort to WDF for when using WST to retain embedded special characters.

And if people really want to be ambitious, a totally new tokenizer that 
subsumed the good parts of WDF would make a lot of lives of Solr users much 
easier.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-03 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988615#comment-14988615
 ] 

Jack Krupansky commented on LUCENE-6874:


Tika is the other (main?) approach to ingesting text from HTML web pages. I 
haven't checked exactly what it does on .

Maybe [~dsmiley] could elaborate on which use case he was encountering that 
inspired this Jira issue.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-03 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988805#comment-14988805
 ] 

Yonik Seeley commented on LUCENE-6874:
--

bq. I'd implement it as a bitset for characters < 65k

A single word can be useful for quickly ruling out whitespace and checking for 
common whitespace.  Example:
https://github.com/yonik/noggit/blob/master/src/main/java/org/noggit/JSONParser.java#L241

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-03 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988791#comment-14988791
 ] 

David Smiley commented on LUCENE-6874:
--

Jack:  My use-case since you asked:  I've got a document store of content in 
XML that provides various markup around mostly text.  These documents 
occasionally have an NBSP.  I process it outside of Solr to produce the text I 
want indexed/stored -- it's not XML any more.  An NBSP entity, if found, is 
converted to the NBSP character naturally as part of Java's XML libraries (no 
explicit decision on my part).

bq. Implicitly then, you're nixing ICUWhitespaceTokenizer, since it can't be in 
analyzers-common.

Right; ah well.

RE what to name the attribute:  I suggest "definition" or even better: "rule" 
(or "ruleset")

I do think the first-line sentence of these whitespace tokenizers should point 
to what definition of whitespace is chosen.  And that they reference each other 
so that anyone stumbling on them will know of the other.

RE WDF:  I prefer WhitespaceTokenizer with WDF for not just product-id data but 
also full-text.  Full-text might contain product-ids, or have things like 
"wi-fi" and many other words, like say "thread-safe" or "co-worker" that are 
sometimes hyphenated, sometimes not; some of these might be space-separated; 
etc..  WDF is very flexible but if you use a Tokenizer like Standard* or 
Classic* then hyphen will be pre-tokenized before WDF can do its thing, 
neutering part of its benefit.  I wish WDF kept payloads and other attributes; 
but it's not the only offender here, and likewise for the bigger issue of 
positionLength.  Otherwise I'm a WDF fan :-)  Nonetheless I like some of Jack's 
ideas on a better tokenizer that subsumes WDF.

BTW, FWIW if I had to write a WhitespaceTokenizer from scratch, I'd implement 
it as a bitset for characters < 65k (this is 8KB memory).  For the remainder 
I'd use an array that is scanned; but it appears there are none beyond 65k as I 
look at a table of these char's from a quick google search.  Then a 
configurable definition loader could fill named whitespace rules and it might 
be configurable to add or remove certain codes.  But no need to bother; Steve's 
impl is fine :-)

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-03 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987410#comment-14987410
 ] 

Uwe Schindler commented on LUCENE-6874:
---

Thanks Steve! I just noticed that your patch contains hardcoded filenames of 
your local system, I think those are just leftovers from your testing with 
reuters. Otherwise I am not happy about the size of the generated files, but 
thats how jflex works...

Sorry for forgetting to add the analyzer factory, I was just too fast in 
copypasting code yesterday :-) Thanks for adding.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985602#comment-14985602
 ] 

Uwe Schindler commented on LUCENE-6874:
---

bq. unicode whitespace is probably more useful and already well-defined

I would solve this issue by adding a ICUWhiteSpaceTokenizer using this 
definition to the ICU module. Problem solved (it is in fact a small patch with 
a Tokenizer extends CharTokenizer and the factory).

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985587#comment-14985587
 ] 

Robert Muir commented on LUCENE-6874:
-

I don't think we should make yet another definition of whitespace for java, 
there are already effectively 3 (Java isWhiteSpace(), java isSpaceChar(), and 
unicode whitespace): I think it would be better to just expose "unicode 
whitespace space" for situations like this.

java isSpaceChar is impractical, lets not even go there: it does not include 
controls such as tabs. so isSpaceChar('\t') == false and so on.

unicode whitespace is probably more useful and already well-defined:

http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UCharacter.html#isUWhiteSpace%28int%29
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:White_Space=Yes:]


> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985593#comment-14985593
 ] 

Uwe Schindler commented on LUCENE-6874:
---

bq. In short, the benefits to Solr users for NBSP being tokenized as white 
space seem to outweigh any minor use cases for treating it as non-white space. 
A compatibility mode can be provided if those minor use cases are considered 
truly worthwhile.

As said before. If we want to change this we need a new Tokenizer with new name 
and a new Factory. Please don't add new matchVersion constants for that because 
this is a huge break. The Tokenizer does what it should and what is documentes: 
This is not a bug.

And still this holds: Users should prefer StandardTokenizer, the wide usage of 
WhitespaceTokenizer is caused by tons of example configs from earlier Solr days 
that uses WhiteSpaceTokenizer together with broken WordDestroyerFilter. This is 
indeed only useful for product numbers, but not fulltext.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985598#comment-14985598
 ] 

David Smiley commented on LUCENE-6874:
--

bq. So maybe we should solve this problem by adding some documentation?

If the vast majority (like 90%+) of users that currently use 
WhitespaceTokenizer would want to tokenize on it, then I don't think 
documentation is sufficient at all.  Documenting something most people would 
want to change is very very easy to overlook.  That's what I call a _trap_; not 
that there might be some uses for the current behavior.  Lucene should do what 
most users want it do do by default.  As Jack said, the users of the search 
platform don't care what Java's definition of Character.isWhitespace is.

I propose WhitespaceTokenizerFactory have a flag for this, and that it default 
to consider NBSP a space based on Lucene's Version.

I get Uwe's point that there are other Tokenizers.  But I disagree that 
WhitespaceTokenizer shouldn't be used for "classical full text".  For example 
StandardTokenizer tokenizes on hypthen and thus foils some of the benefit of 
WordDelimiterFilter.  Maybe ICUTokenizer is an answer; I haven't checked it's 
interaction with WDF.  But why can't we just have a tokenizer that just 
tokenizes simply on all whitespace?

I'll have to see the links Rob just posted; I haven't read them yet.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985611#comment-14985611
 ] 

Uwe Schindler commented on LUCENE-6874:
---

[~rcmuir]: I am already preparing a patch :-)

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985605#comment-14985605
 ] 

Robert Muir commented on LUCENE-6874:
-

You can add a CharTokenizer to ICU analysis module that just looks like this:

{code}
  protected boolean isTokenChar(int c) {
return !UCharacter.isUWhiteSpace(c);
  }
{code}

If you are not happy with it needing ICU library, the definition of this 
property in ICU is "Space characters+TAB+CR+LF-ZWSP-ZWNBSP" 
(http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UProperty.html#WHITE_SPACE)
 so it shouldn't be hard to implement with just the jdk, but I do not know 
about the efficiency of that.

Either way, I think it should just be a different tokenizer.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985704#comment-14985704
 ] 

David Smiley commented on LUCENE-6874:
--

Uwe,
I beg to differ on WDF but I think we can put that behind us.  It's great to 
see a solution come together on using ICU's rules :-)

For something as trivial as detecting if the character is in a set, do we 
really need to depend on ICU?  It would be so nice to not need it, even if it's 
internal implementation seems fast.  Then we could consider deprecating 
WhitespaceTokenizer since, after all, why would one use it when 
ICUWhitespaceTokenizer exists?  Anyone wanting atypical tokenization could 
easily subclass CharTokenizer or consider a MappingCharFilter.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985748#comment-14985748
 ] 

Robert Muir commented on LUCENE-6874:
-

I don't think we need to deprecate whitespacetokenizer, i think its intuitive 
for a java developer, and lucene is a java API.

If you really must avoid ICU: I already mentioned, you can probably implement 
it yourself, but i just don't think it will be as fast. you will probably have 
to precompute and cache for latin1 and all kinds of stuff to make it 
competitive. and it will be messier and so on.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985749#comment-14985749
 ] 

Uwe Schindler commented on LUCENE-6874:
---

bq. Then we could consider deprecating WhitespaceTokenizer since, after all, 
why would one use it when ICUWhitespaceTokenizer exists?

Because the non-breaking space is useful for stuff (as explained above) where 
you want to keep tokens together (although usicode standard speaks about line 
wrapping, but in any case, like soft hyphen vs. hyphen, its just a matter what 
you want to do: the NBSP just tells the tokenizer or line-breaker or how you 
call it to keep tokens together). The problem are people that misuse {{}} 
in their HTML shit (e.g. tables). But for stuff I had implemented very often, I 
used WhitepsaceTokenizer to split tokens and placed a non-breaking space to 
keep tokens together.

So there is no need to deprecate WhitespaceTokenizer. It does what it should 
do. ICUWhitespaceTokenizer is using same naming and does the same, just with 
different rules.

bq. ... or consider a MappingCharFilter

This is thing is slow like hell. If you want it faster, e.g. use 
PatternTokenizer.

bq. It would be so nice to not need it, even if it's internal implementation 
seems fast

The problem is that you need to do additional 4-way branching: You have to 
check in {{isTokenChar()}} that it is {{!Whitespace()}} and also exclude all 
those 3 chars we listed in the description: {{'\u00A0', '\u2007', '\u202F'}}.

I agree with Robert: We should not change the default WhitespaceTokenizer and 
also not deprecate it. We should add a new one, which I did in supplied patch. 
If we want it in core, lets call it different and implement isTokenChar in a 
fast way without 3 additional branches.

bq. I beg to differ on WDF

This is coming from the fact that Solr is often misused because users just give 
up to think about tokenization. WDF only makes sense in product catalogues, but 
it is definitely broken for fulltext. The product catalogues are of course some 
of our customers, but before I suggest to them that they should use 
WhiteSpaceTokenizer with WordDestroyerFilter, I would analyzer their root 
problem (why is their tokenization broken). This is why I am against the broken 
example configs in Solr we had in the past. Because WST and WDF should really 
only be used as a last resort.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985799#comment-14985799
 ] 

Uwe Schindler commented on LUCENE-6874:
---

For those people that tend to think that StandardTokenizer is too aggressive, I 
generally suggest to my users to use ClassicTokenizer if they know that they 
only have European languages in their index. This is one reason why I was 
always against removing/deprecating ClassicTokenizer! It does a really good job 
for plain european text and also keeps most product numbers together (because 
it has quite good heuristics). It only breaks product numbers with slashes 
(e.g., "PS-10/20").

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985776#comment-14985776
 ] 

Steve Rowe commented on LUCENE-6874:


A JFlex version would be fast and simple and not require ICU to keep up with 
Unicode changes.  Not sure about needing to cache Latin-1 and other stuff to be 
competitive.  I'll give it a go later today.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985885#comment-14985885
 ] 

Uwe Schindler commented on LUCENE-6874:
---

One thing to make it full flexible in Lucene Trunk (Java 8 only). I know this 
would not help Solr users that want to define the Tokenizer in a config file, 
but for real Lucene users the Java 8-like way would be the following static 
method on CharTokenizer:

{code:java}
public static CharTokenizer fromPredicate(java.util.function.IntPredicate 
predicate)
{code}

This would allow to define a new CharTokenizer with a single line statement 
using any predicate:

{code:java}
// long variant with lambda:
Tokenizer tok = CharTokenizer.fromPredicate(c -> !UCharacter.isUWhiteSpace(c));

// method reference:
Tokenizer tok = CharTokenizer.fromPredicate( 
(UCharacter::isUWhiteSpace).negate() );

// method reference to custom function:
private boolean myTestFunction(int c) {
 return (cracy condition);
}

Tokenizer tok = CharTokenizer.fromPredicate(c -> this::myTestFunction);
{code}

I think we should do this in a separate issue in Lucene trunk for Java 8. This 
is really the way for which Java 8 Lambdas are made for. And its fast like 
hell, because its compiled to native bytecode so there is no call overhead.

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984873#comment-14984873
 ] 

Dawid Weiss commented on LUCENE-6874:
-

Depends what you consider a trap. 

A non-breakable whitespace could be a legitimate way to prevent two tokens from 
being separated if they need to be tokenized together. An example that comes to 
my mind is the special "zero-width" space or the hyphenation marker... which 
even on its own poses a problem [1]...

Ultimately it should be probably the question of whether we want to tokenize on 
"whitespace as in formatted text" or "whitespace as in logical codepoint units" 
and it doesn't apply to the WhitespaceTokenizer only, but to any tokenizer in 
general?

bq. I think WhitespaceTokenizer should tokenize on this.

Seems like majority of people would want it to be tokenized, I agree. But if 
you change this then there is no way to go back to previous behavior. Currently 
it's relatively easy to wrap your input in a reader that replaces those 
problematic codepoints on the fly before they're fed to the tokenizer?

[1] https://www.cs.tut.fi/~jkorpela/shy.html

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985540#comment-14985540
 ] 

Jack Krupansky commented on LUCENE-6874:


+1 for using the Unicode definition of white space rather than the (odd) Java 
definition. From a Solr user perspective, the fact that Java is used for 
implementation under the hood should be irrelevant. That said, the Javadoc for 
WhitespaceTokenizer#isTokenChar does explicitly refer to isWhitespace already.

The term "non-breaking white space" explicitly refers to line breaking and has 
no mention of tokens in either Unicode or traditional casual usage.

>From a Solr user perspective, there is like zero value to having NBSP from 
>HTML web pages being treated as if it were not traditional white space.

>From a Solr user perspective, the primary use of whitespace tokenizer is to 
>avoid the fact that standard tokenizer breaks on various special characters 
>such as occur in product numbers.

In short, the benefits to Solr users for NBSP being tokenized as white space 
seem to outweigh any minor use cases for treating it as non-white space. A 
compatibility mode can be provided if those minor use cases are considered 
truly worthwhile.


> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984940#comment-14984940
 ] 

Dawid Weiss commented on LUCENE-6874:
-

Any improvement to the docs that clarify what the software does would be great 
:)

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984896#comment-14984896
 ] 

Adrien Grand commented on LUCENE-6874:
--

So maybe we should solve this problem by adding some documention?

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986193#comment-14986193
 ] 

Uwe Schindler commented on LUCENE-6874:
---

I opened LUCENE-6879 for the idea (which is related to this issue as it 
provides a simple and general way suitable for Java 8).

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985076#comment-14985076
 ] 

Uwe Schindler commented on LUCENE-6874:
---

My personal opinion on this:
- The thing is called WhitespaceTokenizer, so it should do what the name says 
(split on isWhitespace).
- If we want something else, maybe provide a separate CharTokenizer 
implementation that also splits on NBSP

In general, whitespace tokenizer is not used for "classical" fulltext. For this 
type of text one would better use StandardTokenizer, ICU's Tokenizers or the 
language specific ones for Chinese or Japan. People using WhitespaceTokenizer 
are more those people which have very special types of fields, like a list of 
whitespace-separated tokens used for facetting or stuff like a list of product 
numbers. These types of tokens were always good to handle with 
WhitespaceTokenizer. If you wanted to keep your facet tokens together, you were 
able to use NBSP! So a change here would be a break for those apps :-)

So I would just update documentation to explain what this thing does (splitting 
on whitespace and not on spaces in general).

> WhitespaceTokenizer should tokenize on NBSP
> ---
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Smiley
>Priority: Minor
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org