[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006170#comment-15006170 ] Steve Rowe commented on LUCENE-6874: bq. Thank for the fruitful discussions! I hope Steve Rowe is not unhappy because we did not use jflex for this simple case and instead use Unicode data with the already existing CharTokenizer. No worries, +1 to your work Uwe. You've also laid the groundwork for future simple Unicode property-based char tokenizers, which is nice. Thanks! > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Assignee: Uwe Schindler >Priority: Minor > Fix For: Trunk, 5.4 > > Attachments: LUCENE-6874-chartokenizer.patch, > LUCENE-6874-chartokenizer.patch, LUCENE-6874-chartokenizer.patch, > LUCENE-6874-jflex.patch, LUCENE-6874.patch, LUCENE_6874_jflex.patch, > icu-datasucker.patch, unicode-ws-tokenizer.patch, unicode-ws-tokenizer.patch, > unicode-ws-tokenizer.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005553#comment-15005553 ] ASF subversion and git services commented on LUCENE-6874: - Commit 1714354 from [~thetaphi] in branch 'dev/trunk' [ https://svn.apache.org/r1714354 ] LUCENE-6874: Add a new UnicodeWhitespaceTokenizer to analysis/common that uses Unicode character properties extracted from ICU4J to tokenize text on whitespace > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Assignee: Uwe Schindler >Priority: Minor > Attachments: LUCENE-6874-chartokenizer.patch, > LUCENE-6874-chartokenizer.patch, LUCENE-6874-chartokenizer.patch, > LUCENE-6874-jflex.patch, LUCENE-6874.patch, LUCENE_6874_jflex.patch, > icu-datasucker.patch, unicode-ws-tokenizer.patch, unicode-ws-tokenizer.patch, > unicode-ws-tokenizer.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005563#comment-15005563 ] ASF subversion and git services commented on LUCENE-6874: - Commit 1714355 from [~thetaphi] in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1714355 ] Merged revision(s) 1714354 from lucene/dev/trunk: LUCENE-6874: Add a new UnicodeWhitespaceTokenizer to analysis/common that uses Unicode character properties extracted from ICU4J to tokenize text on whitespace > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Assignee: Uwe Schindler >Priority: Minor > Attachments: LUCENE-6874-chartokenizer.patch, > LUCENE-6874-chartokenizer.patch, LUCENE-6874-chartokenizer.patch, > LUCENE-6874-jflex.patch, LUCENE-6874.patch, LUCENE_6874_jflex.patch, > icu-datasucker.patch, unicode-ws-tokenizer.patch, unicode-ws-tokenizer.patch, > unicode-ws-tokenizer.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004794#comment-15004794 ] Uwe Schindler commented on LUCENE-6874: --- If nobody objects, I will commit this tomorrow. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-chartokenizer.patch, > LUCENE-6874-chartokenizer.patch, LUCENE-6874-chartokenizer.patch, > LUCENE-6874-jflex.patch, LUCENE-6874.patch, LUCENE_6874_jflex.patch, > icu-datasucker.patch, unicode-ws-tokenizer.patch, unicode-ws-tokenizer.patch, > unicode-ws-tokenizer.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002248#comment-15002248 ] Uwe Schindler commented on LUCENE-6874: --- Here is the output of the reuters test: {noformat} > Report Sum By (any) Name and Round (28 about 33 out of 34) Operationround runCnt recsPerRunrec/s elapsedSecavgUsedMemavgTotalMem AnalyzerFactory(name:WhitespaceTokenizer,WhitespaceTokenizer(rule:java)) 010 0.000.00 9,569,344 124,256,256 AnalyzerFactory(name:UnicodeWhitespaceTokenizer,WhitespaceTokenizer(rule:unicode)) - 0 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 9,569,344 - 124,256,256 Rounds_5 01 24493540 360,841.19 67.8816,566,472124,256,256 NewAnalyzer(WhitespaceTokenizer) - - - - - - - - - - - - - - - - - - 0 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 9,569,344 - 124,256,256 [Character.isWhitespace()] WhitespaceTokenizer 01 2449354 331,038.537.4022,121,256 124,256,256 Seq_2 - - - - - - - - - - - - - - - - - - 0 - - 2 - - 2449354 - 344,131.22 - - 14.23 - 22,121,256 - 118,489,088 NewAnalyzer(UnicodeWhitespaceTokenizer) 010 0.000.0022,121,256 112,721,920 [UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer - - - - - - - - - 0 - - 1 - - 2449354 - 358,302.22 - - 6.84 - 22,121,256 - 112,721,920 NewAnalyzer(WhitespaceTokenizer) 110 0.000.0012,138,024 112,721,920 [Character.isWhitespace()] WhitespaceTokenizer - - - - - - - - - - - - - 1 - - 1 - - 2449354 - 366,724.66 - - 6.68 - 22,374,536 - 112,721,920 Seq_2 12 2449354 365,139.25 13.4227,477,352117,702,656 NewAnalyzer(UnicodeWhitespaceTokenizer) - - - - - - - - - - - - - - - - 1 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 22,374,536 - 111,673,344 [UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer 11 2449354 363,567.476.7432,580,168 122,683,392 NewAnalyzer(WhitespaceTokenizer) - - - - - - - - - - - - - - - - - - 2 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 32,580,168 - 122,683,392 [Character.isWhitespace()] WhitespaceTokenizer 21 2449354 365,793.596.7033,461,280 122,683,392 Seq_2 - - - - - - - - - - - - - - - - - - 2 - - 2 - - 2449354 - 365,112.03 - - 13.42 - 33,461,280 - 117,178,368 NewAnalyzer(UnicodeWhitespaceTokenizer) 210 0.000.0033,461,280 111,673,344 [UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer - - - - - - - - - 2 - - 1 - - 2449354 - 364,432.97 - - 6.72 - 33,461,280 - 111,673,344 NewAnalyzer(WhitespaceTokenizer) 310 0.000.0010,836,464 111,673,344 [Character.isWhitespace()] WhitespaceTokenizer - - - - - - - - - - - - - 3 - - 1 - - 2449354 - 367,660.47 - - 6.66 - 12,451,400 - 111,673,344 Seq_2 32 2449354 365,820.94 13.3913,235,672111,673,344 NewAnalyzer(UnicodeWhitespaceTokenizer) - - - - - - - - - - - - - - - - 3 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 12,451,400 - 111,673,344 [UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer 31 2449354 363,999.696.7314,019,944 111,673,344 NewAnalyzer(WhitespaceTokenizer) - - - - - - - - - - - - - - - - - - 4 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 14,019,944 - 111,673,344 [Character.isWhitespace()] WhitespaceTokenizer 41 2449354 367,329.626.6715,061,368 111,673,344 Seq_2 - - - - - - - - - - - - - - - - - - 4 - - 2 - - 2449354 - 365,057.59 - - 13.42 - 15,813,920 - 111,673,344 NewAnalyzer(UnicodeWhitespaceTokenizer) 410 0.000.0015,061,368 111,673,344 [UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer - - - - - - - - - 4 - - 1 - - 2449354 - 362,813.50 - - 6.75
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002388#comment-15002388 ] David Smiley commented on LUCENE-6874: -- +1 Patch is good Uwe. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-chartokenizer.patch, > LUCENE-6874-chartokenizer.patch, LUCENE-6874-chartokenizer.patch, > LUCENE-6874-jflex.patch, LUCENE-6874.patch, LUCENE_6874_jflex.patch, > icu-datasucker.patch, unicode-ws-tokenizer.patch, unicode-ws-tokenizer.patch, > unicode-ws-tokenizer.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001688#comment-15001688 ] David Smiley commented on LUCENE-6874: -- +1 I like it Uwe; nice job. Automating the generation of the unicode code points is good, and may come in handy if we want other unicode rules. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, > LUCENE_6874_jflex.patch, icu-datasucker.patch, unicode-ws-tokenizer.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001338#comment-15001338 ] Uwe Schindler commented on LUCENE-6874: --- Result when running: {noformat} unicode-tokenizers: [groovy] Unicode version: 7.0.0.0 [groovy] Whitespace: 9, 10, 11, 12, 13, 28, 29, 30, 31, 32, 5760, 8192, 8193, 8194, 8195, 8196, 8197, 8198, 8200, 8201, 8202, 8232, 8233, 8287, 12288 {noformat} > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, > LUCENE_6874_jflex.patch, icu-datasucker.patch, icu-datasucker.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001345#comment-15001345 ] Uwe Schindler commented on LUCENE-6874: --- Sorry my fault, must be UCharacter.isUWhitespace(), result is then: {noformat} [groovy] Unicode version: 7.0.0.0 [groovy] Whitespace: 9, 10, 11, 12, 13, 32, 133, 160, 5760, 8192, 8193, 8194, 8195, 8196, 8197, 8198, 8199, 8200, 8201, 8202, 8232, 8233, 8239, 8287, 12288 {noformat} > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, > LUCENE_6874_jflex.patch, icu-datasucker.patch, icu-datasucker.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001346#comment-15001346 ] Steve Rowe commented on LUCENE-6874: Uwe, you're using UCharacter,isWhitespace(), but that's the same as the problematic Java Character.isWhitespace() -- note the exclusion of U+00A0 in your output Whitespace char list -- http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UCharacter.html says what you want is {{isUWhiteSpace(c)}} or {{hasBinaryProperty(c, UProperty.WHITE_SPACE)}} > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, > LUCENE_6874_jflex.patch, icu-datasucker.patch, icu-datasucker.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001349#comment-15001349 ] Uwe Schindler commented on LUCENE-6874: --- Sorry updated my post, recognized this a minute ago. :-) > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, > LUCENE_6874_jflex.patch, icu-datasucker.patch, icu-datasucker.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001000#comment-15001000 ] Steve Rowe commented on LUCENE-6874: bq. My idea was to use a Unicode data file and extract all Whitespace characters in a build tool. Shipping with the usicode data file would be a large overhead. The JFlex project has a similar requirement, but for many more properties than just Whitespace. JFlex includes a Maven plugin used by the build that parses Unicode data files via (you guessed it) JFlex scanners - here's the JFlex spec for the parser for binary property data files, including {{PropList.txt}}, which holds the Whitespace property definition: https://github.com/jflex-de/jflex/blob/master/jflex-unicode-maven-plugin/src/main/jflex/BinaryPropertiesFileScanner.flex Note: Unicode property names can have aliases, and "loose" matching is the recommended way to refer to them (see http://unicode.org/reports/tr18/#Categories ): match case-insensitively, and ignore whitespace, dashes, and underscores. {{PropList.txt}} gives the Whitespace property name as {{White_Space}}, and {{PropertyAliases.txt}} lists {{WSpace}} and {{space}} as aliases. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, > LUCENE_6874_jflex.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000940#comment-15000940 ] Uwe Schindler commented on LUCENE-6874: --- bq. Why persist the bitset and deal with the build issues around that when instead it could be done in-memory in a static initializer? It's so cheap to build that I question the effort in pre-building it as part of the build process. Where to get the information for the bitset from? The Unicode data as of java.lang.Character class is wrong! :-) My idea was to use a Unicode data file and extract all Whitespace characters in a build tool. Shipping with the usicode data file would be a large overhead. bq. On 5.x, Uwe, Adrien, how do you feel about the WhitespaceTokenizerFactory I have in the patch with the "rule" attribute to pick? I am not happy, but could live with it. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, > LUCENE_6874_jflex.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001257#comment-15001257 ] Uwe Schindler commented on LUCENE-6874: --- Cool! So my idea would be to write a small tool in the analysis/commons/src/tools module that uses the current jflex JAR file in classpath and extract the values for the bitset from the JFlex JAR file by accessing this class: https://github.com/jflex-de/jflex/blob/master/jflex/src/main/java/jflex/unicode/data/Unicode_6_3.java In my opinion this would be more elegant than creating a huge jflex UnicodeWhitespaceTokenizer in a separate submodule, if we could just just a codegenerator that produces a BitSet to be used in a CharTokenizer subclass. CharTokenizer is thoroughly tested, so just feeding it with a bitset would be my preference. Would this work? > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, > LUCENE_6874_jflex.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001312#comment-15001312 ] Steve Rowe commented on LUCENE-6874: bq. My idea was to create the whitespace chars as int[] array for every unicode version Each ICU4J release only has data for the (single) Unicode version it was built against, so it won't work for this purpose. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, > LUCENE_6874_jflex.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001283#comment-15001283 ] Steve Rowe commented on LUCENE-6874: bq. Would this work? Yes, but I think ICU4J is more authoritative, and already effectively pins the Unicode version used in Lucene, so I'd recommend going with it instead of JFlex to extract characters with property X. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, > LUCENE_6874_jflex.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001292#comment-15001292 ] Uwe Schindler commented on LUCENE-6874: --- My idea was to create the whitespace chars as int[] array for every unicode version and allow WhitespaceTokenizer to specify the version. As we have the data files already, the same groovy code against jflex data (or ICU4J) could be used to build LetterTokenizer, too? This would also allw backwards compatibility. By default WhitespaceTokenizer would use latest Unicode version, unless you specify a version (as enum constant). > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, > LUCENE_6874_jflex.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001307#comment-15001307 ] Uwe Schindler commented on LUCENE-6874: --- ...hacking Groovy script using ICU4J as specified in ivy-versions.properties... > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, > LUCENE_6874_jflex.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998838#comment-14998838 ] David Smiley commented on LUCENE-6874: -- Uwe, Why persist the bitset and deal with the build issues around that when instead it could be done in-memory in a static initializer? It's so cheap to build that I question the effort in pre-building it as part of the build process. On 5.x, Uwe, Adrien, how do you feel about the WhitespaceTokenizerFactory I have in the patch with the "rule" attribute to pick? > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, > LUCENE_6874_jflex.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997057#comment-14997057 ] David Smiley commented on LUCENE-6874: -- Sorry, I really disagree with you on this. I don't think this WhitespaceTokenizerFactory is hard to maintain at all. It's true that it's harder only because it was a trivial factory before but so what? Most importantly, I think it's a better user experience -- nobody should care what the specific Java Tokenizer implementation class will be coming out of the factory -- it's a tokenizer on whitespace using whatever definition/rule of whitespace they configured. That could hypothetically be implemented using one Java Tokenizer implementing class or multiple but that's an implementation detail. bq. Why is the ICUWhitespace being added? I'll remove that in a new patch; I wasn't sure what to do but it's redundant so no need for it. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, > LUCENE_6874_jflex.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998105#comment-14998105 ] Uwe Schindler commented on LUCENE-6874: --- I would be fine to remove WhitespaceTokenizer in Lucene trunk. In that case I would also like to move the abstract CharTokenizer class out of oal.analysis.util to oal.analysis.core. This is not a big deal, but the util package is not fine for a first class citizen. I also have another idea about this issue; I would prefer to not have the large java code with jflex involved. Wouldn't it be possible to save the isWhitespace data of Unicode in a compressible Lucene bitset and save it to disk as resource file? We could then load the bitset (like deleted documents) from a resource file and wrap a simple {{CharTokenizer.fromSeparatorCharPredicate(c -> compressedBitset.get(c))}} on top? The bitset could be generated from Unicode data on "ant regenerate". > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, > LUCENE_6874_jflex.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997516#comment-14997516 ] Uwe Schindler commented on LUCENE-6874: --- Yeah remove it! LUCENE-6879 is enough to quickly define your own WhitespaceTokenizer with ICU, if you want. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, > LUCENE_6874_jflex.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997645#comment-14997645 ] Adrien Grand commented on LUCENE-6874: -- I tend to like Uwe's idea. I have often wondered what the actual use-cases of WhitespaceTokenizer were but did not suggest to remove it as the cost of maintenance was very low given its simplicity. However now that there is some controversy arising and given how simple it is to create character-based tokenizers in trunk {{Tokenizer tok = CharTokenizer.fromSeparatorCharPredicate(Character::isWhitespace);}}, maybe we should just remove this tokenizer and let users define it themselves with the more flexible {{CharTokenizer.fromSeparatorCharPredicate}}? > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, > LUCENE_6874_jflex.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1499#comment-1499 ] David Smiley commented on LUCENE-6874: -- Just for clarification, Adrien, are you suggesting that WhitespaceTokenizerFactory stays, but WhitespaceTokenizer get removed (because it's easy to define one)? I'm +1 to that. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, > LUCENE_6874_jflex.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14995169#comment-14995169 ] Robert Muir commented on LUCENE-6874: - The shared factory is confusing: this is supposed to be a simple thing. Now we have some parameters that depend on other parameters and so on. Please, make two factories. We have to be able to maintain this stuff. Why is the ICUWhitespace being added? > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, > LUCENE_6874_jflex.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987423#comment-14987423 ] Steve Rowe commented on LUCENE-6874: bq. I just noticed that your patch contains hardcoded filenames of your local system, I think those are just leftovers from your testing with reuters. Yup, patch needs cleanup before it can be committed. I figured the decision about what to do hasn't been made yet, so I'll wait on doing that work until then. bq. Otherwise I am not happy about the size of the generated files, but thats how jflex works... I don't think the generated Java source makes much difference - the thing people will deal with is the JFlex source, and it's fairly compact. I looked at the .class file sizes on my system, and I see 13k for the JFlex version and 2k for the ICU version. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987455#comment-14987455 ] David Smiley commented on LUCENE-6874: -- Nice thorough job Steve! I propose that we consolidate the TokenizerFactories here into one -- the existing WhitespaceTokenizerFactory. I think this is more user friendly. An attribute could select *which* whitespace definition the user wants: "java" or "unicode". What do you think? > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988290#comment-14988290 ] Steve Rowe commented on LUCENE-6874: bq. I propose that we consolidate the TokenizerFactories here into one – the existing WhitespaceTokenizerFactory. I think this is more user friendly. An attribute could select which whitespace definition the user wants: "java" or "unicode". What do you think? Implicitly then, you're nixing ICUWhitespaceTokenizer, since it can't be in analyzers-common. I'm okay with adding a param to WhitespaceTokenizerFactory (not sure what to name it though: "authority"/"style"/"definition"?) . Since the default wouldn't change ("java" would be the default I assume), I don't think we need to introduce luceneMatchVersion. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988404#comment-14988404 ] Steve Rowe commented on LUCENE-6874: bq. My concern for Solr users is that NBSP occurs somewhat commonly in HTML web pages - as a formatting technique more than an attempt at influencing tokenization. FYI, {{}} is converted to U+0020 by {{HTMLStripCharFilter}}. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988338#comment-14988338 ] Jack Krupansky commented on LUCENE-6874: Certainly Solr can update its example schemas to use whatever alternative tokenizer or option is decided on so that Solr users, many of whom are not Java developers, will no longer fall into this NBSP trap, but... that still feels like a less than desirable resolution. [~thetaphi], could you elaborate more specifically on the existing use case that you are trying to preserve? I mean, like in terms of a real-world example. Where do some of your NBSPs actually live in the wild? It seems to me that the vast majority of normal users would not be negatively impacted by having "white space" be defined using the Unicode model. I never objected to using the Java model, but that's because I had overlooked this nuance of NBSP. My concern for Solr users is that NBSP occurs somewhat commonly in HTML web pages - as a formatting technique more than an attempt at influencing tokenization. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988646#comment-14988646 ] Jack Krupansky commented on LUCENE-6874: bq. Because WST and WDF should really only be used as a last resort. Absolutely agreed. From a Solr user perspective we really need a much simpler model for semi-standard tokens out of the box without the user having to scratch their heads and resorting to WST in the first (last) place. LOL - maybe if we could eliminate this need to resort to WST, we wouldn't have to fret as much about WST. bq. I generally suggest to my users to use ClassicTokenizer Personally, I've always refrained from recommending CT since I thought ST was supposed to replace it and that the email and URL support was considered an excess not worth keeping. I've considered CT as if it were deprecated (which it is not.) And, I never see anybody else recommending it on the user list. And, the fact that it can't handle slashes for product number is a deal killer. I'm not sure that I would argue in favor of resurrecting CT as a first-class recommendation, especially since it can't handle non-European languages, but... That said, I do think it is worth separately (from this Jira) considering a fresh, new tokenizer that starts with the goodness of ST and adds in an approximation of the reasons that people resort to WST. Whether that can be an option on ST or has to be a separate tokenizer would need to be debated. I'd prefer an option on ST, either to simply allow embedded special characters or to specify a list or regex of special character to be allowed or excluded. People would still need to combine NewT with WDF, but at least the tokenization would be more explicit. Personally I would prefer to see an option for whether to retain or strip external punctuation vs. embedded special characters. Trailing periods and commas and columns and enclosing parentheses are just the kinds of things we had to resort to WDF for when using WST to retain embedded special characters. And if people really want to be ambitious, a totally new tokenizer that subsumed the good parts of WDF would make a lot of lives of Solr users much easier. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988615#comment-14988615 ] Jack Krupansky commented on LUCENE-6874: Tika is the other (main?) approach to ingesting text from HTML web pages. I haven't checked exactly what it does on . Maybe [~dsmiley] could elaborate on which use case he was encountering that inspired this Jira issue. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988805#comment-14988805 ] Yonik Seeley commented on LUCENE-6874: -- bq. I'd implement it as a bitset for characters < 65k A single word can be useful for quickly ruling out whitespace and checking for common whitespace. Example: https://github.com/yonik/noggit/blob/master/src/main/java/org/noggit/JSONParser.java#L241 > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988791#comment-14988791 ] David Smiley commented on LUCENE-6874: -- Jack: My use-case since you asked: I've got a document store of content in XML that provides various markup around mostly text. These documents occasionally have an NBSP. I process it outside of Solr to produce the text I want indexed/stored -- it's not XML any more. An NBSP entity, if found, is converted to the NBSP character naturally as part of Java's XML libraries (no explicit decision on my part). bq. Implicitly then, you're nixing ICUWhitespaceTokenizer, since it can't be in analyzers-common. Right; ah well. RE what to name the attribute: I suggest "definition" or even better: "rule" (or "ruleset") I do think the first-line sentence of these whitespace tokenizers should point to what definition of whitespace is chosen. And that they reference each other so that anyone stumbling on them will know of the other. RE WDF: I prefer WhitespaceTokenizer with WDF for not just product-id data but also full-text. Full-text might contain product-ids, or have things like "wi-fi" and many other words, like say "thread-safe" or "co-worker" that are sometimes hyphenated, sometimes not; some of these might be space-separated; etc.. WDF is very flexible but if you use a Tokenizer like Standard* or Classic* then hyphen will be pre-tokenized before WDF can do its thing, neutering part of its benefit. I wish WDF kept payloads and other attributes; but it's not the only offender here, and likewise for the bigger issue of positionLength. Otherwise I'm a WDF fan :-) Nonetheless I like some of Jack's ideas on a better tokenizer that subsumes WDF. BTW, FWIW if I had to write a WhitespaceTokenizer from scratch, I'd implement it as a bitset for characters < 65k (this is 8KB memory). For the remainder I'd use an array that is scanned; but it appears there are none beyond 65k as I look at a table of these char's from a quick google search. Then a configurable definition loader could fill named whitespace rules and it might be configurable to add or remove certain codes. But no need to bother; Steve's impl is fine :-) > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987410#comment-14987410 ] Uwe Schindler commented on LUCENE-6874: --- Thanks Steve! I just noticed that your patch contains hardcoded filenames of your local system, I think those are just leftovers from your testing with reuters. Otherwise I am not happy about the size of the generated files, but thats how jflex works... Sorry for forgetting to add the analyzer factory, I was just too fast in copypasting code yesterday :-) Thanks for adding. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985602#comment-14985602 ] Uwe Schindler commented on LUCENE-6874: --- bq. unicode whitespace is probably more useful and already well-defined I would solve this issue by adding a ICUWhiteSpaceTokenizer using this definition to the ICU module. Problem solved (it is in fact a small patch with a Tokenizer extends CharTokenizer and the factory). > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985587#comment-14985587 ] Robert Muir commented on LUCENE-6874: - I don't think we should make yet another definition of whitespace for java, there are already effectively 3 (Java isWhiteSpace(), java isSpaceChar(), and unicode whitespace): I think it would be better to just expose "unicode whitespace space" for situations like this. java isSpaceChar is impractical, lets not even go there: it does not include controls such as tabs. so isSpaceChar('\t') == false and so on. unicode whitespace is probably more useful and already well-defined: http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UCharacter.html#isUWhiteSpace%28int%29 http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:White_Space=Yes:] > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985593#comment-14985593 ] Uwe Schindler commented on LUCENE-6874: --- bq. In short, the benefits to Solr users for NBSP being tokenized as white space seem to outweigh any minor use cases for treating it as non-white space. A compatibility mode can be provided if those minor use cases are considered truly worthwhile. As said before. If we want to change this we need a new Tokenizer with new name and a new Factory. Please don't add new matchVersion constants for that because this is a huge break. The Tokenizer does what it should and what is documentes: This is not a bug. And still this holds: Users should prefer StandardTokenizer, the wide usage of WhitespaceTokenizer is caused by tons of example configs from earlier Solr days that uses WhiteSpaceTokenizer together with broken WordDestroyerFilter. This is indeed only useful for product numbers, but not fulltext. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985598#comment-14985598 ] David Smiley commented on LUCENE-6874: -- bq. So maybe we should solve this problem by adding some documentation? If the vast majority (like 90%+) of users that currently use WhitespaceTokenizer would want to tokenize on it, then I don't think documentation is sufficient at all. Documenting something most people would want to change is very very easy to overlook. That's what I call a _trap_; not that there might be some uses for the current behavior. Lucene should do what most users want it do do by default. As Jack said, the users of the search platform don't care what Java's definition of Character.isWhitespace is. I propose WhitespaceTokenizerFactory have a flag for this, and that it default to consider NBSP a space based on Lucene's Version. I get Uwe's point that there are other Tokenizers. But I disagree that WhitespaceTokenizer shouldn't be used for "classical full text". For example StandardTokenizer tokenizes on hypthen and thus foils some of the benefit of WordDelimiterFilter. Maybe ICUTokenizer is an answer; I haven't checked it's interaction with WDF. But why can't we just have a tokenizer that just tokenizes simply on all whitespace? I'll have to see the links Rob just posted; I haven't read them yet. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985611#comment-14985611 ] Uwe Schindler commented on LUCENE-6874: --- [~rcmuir]: I am already preparing a patch :-) > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985605#comment-14985605 ] Robert Muir commented on LUCENE-6874: - You can add a CharTokenizer to ICU analysis module that just looks like this: {code} protected boolean isTokenChar(int c) { return !UCharacter.isUWhiteSpace(c); } {code} If you are not happy with it needing ICU library, the definition of this property in ICU is "Space characters+TAB+CR+LF-ZWSP-ZWNBSP" (http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UProperty.html#WHITE_SPACE) so it shouldn't be hard to implement with just the jdk, but I do not know about the efficiency of that. Either way, I think it should just be a different tokenizer. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985704#comment-14985704 ] David Smiley commented on LUCENE-6874: -- Uwe, I beg to differ on WDF but I think we can put that behind us. It's great to see a solution come together on using ICU's rules :-) For something as trivial as detecting if the character is in a set, do we really need to depend on ICU? It would be so nice to not need it, even if it's internal implementation seems fast. Then we could consider deprecating WhitespaceTokenizer since, after all, why would one use it when ICUWhitespaceTokenizer exists? Anyone wanting atypical tokenization could easily subclass CharTokenizer or consider a MappingCharFilter. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985748#comment-14985748 ] Robert Muir commented on LUCENE-6874: - I don't think we need to deprecate whitespacetokenizer, i think its intuitive for a java developer, and lucene is a java API. If you really must avoid ICU: I already mentioned, you can probably implement it yourself, but i just don't think it will be as fast. you will probably have to precompute and cache for latin1 and all kinds of stuff to make it competitive. and it will be messier and so on. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985749#comment-14985749 ] Uwe Schindler commented on LUCENE-6874: --- bq. Then we could consider deprecating WhitespaceTokenizer since, after all, why would one use it when ICUWhitespaceTokenizer exists? Because the non-breaking space is useful for stuff (as explained above) where you want to keep tokens together (although usicode standard speaks about line wrapping, but in any case, like soft hyphen vs. hyphen, its just a matter what you want to do: the NBSP just tells the tokenizer or line-breaker or how you call it to keep tokens together). The problem are people that misuse {{}} in their HTML shit (e.g. tables). But for stuff I had implemented very often, I used WhitepsaceTokenizer to split tokens and placed a non-breaking space to keep tokens together. So there is no need to deprecate WhitespaceTokenizer. It does what it should do. ICUWhitespaceTokenizer is using same naming and does the same, just with different rules. bq. ... or consider a MappingCharFilter This is thing is slow like hell. If you want it faster, e.g. use PatternTokenizer. bq. It would be so nice to not need it, even if it's internal implementation seems fast The problem is that you need to do additional 4-way branching: You have to check in {{isTokenChar()}} that it is {{!Whitespace()}} and also exclude all those 3 chars we listed in the description: {{'\u00A0', '\u2007', '\u202F'}}. I agree with Robert: We should not change the default WhitespaceTokenizer and also not deprecate it. We should add a new one, which I did in supplied patch. If we want it in core, lets call it different and implement isTokenChar in a fast way without 3 additional branches. bq. I beg to differ on WDF This is coming from the fact that Solr is often misused because users just give up to think about tokenization. WDF only makes sense in product catalogues, but it is definitely broken for fulltext. The product catalogues are of course some of our customers, but before I suggest to them that they should use WhiteSpaceTokenizer with WordDestroyerFilter, I would analyzer their root problem (why is their tokenization broken). This is why I am against the broken example configs in Solr we had in the past. Because WST and WDF should really only be used as a last resort. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985799#comment-14985799 ] Uwe Schindler commented on LUCENE-6874: --- For those people that tend to think that StandardTokenizer is too aggressive, I generally suggest to my users to use ClassicTokenizer if they know that they only have European languages in their index. This is one reason why I was always against removing/deprecating ClassicTokenizer! It does a really good job for plain european text and also keeps most product numbers together (because it has quite good heuristics). It only breaks product numbers with slashes (e.g., "PS-10/20"). > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985776#comment-14985776 ] Steve Rowe commented on LUCENE-6874: A JFlex version would be fast and simple and not require ICU to keep up with Unicode changes. Not sure about needing to cache Latin-1 and other stuff to be competitive. I'll give it a go later today. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985885#comment-14985885 ] Uwe Schindler commented on LUCENE-6874: --- One thing to make it full flexible in Lucene Trunk (Java 8 only). I know this would not help Solr users that want to define the Tokenizer in a config file, but for real Lucene users the Java 8-like way would be the following static method on CharTokenizer: {code:java} public static CharTokenizer fromPredicate(java.util.function.IntPredicate predicate) {code} This would allow to define a new CharTokenizer with a single line statement using any predicate: {code:java} // long variant with lambda: Tokenizer tok = CharTokenizer.fromPredicate(c -> !UCharacter.isUWhiteSpace(c)); // method reference: Tokenizer tok = CharTokenizer.fromPredicate( (UCharacter::isUWhiteSpace).negate() ); // method reference to custom function: private boolean myTestFunction(int c) { return (cracy condition); } Tokenizer tok = CharTokenizer.fromPredicate(c -> this::myTestFunction); {code} I think we should do this in a separate issue in Lucene trunk for Java 8. This is really the way for which Java 8 Lambdas are made for. And its fast like hell, because its compiled to native bytecode so there is no call overhead. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984873#comment-14984873 ] Dawid Weiss commented on LUCENE-6874: - Depends what you consider a trap. A non-breakable whitespace could be a legitimate way to prevent two tokens from being separated if they need to be tokenized together. An example that comes to my mind is the special "zero-width" space or the hyphenation marker... which even on its own poses a problem [1]... Ultimately it should be probably the question of whether we want to tokenize on "whitespace as in formatted text" or "whitespace as in logical codepoint units" and it doesn't apply to the WhitespaceTokenizer only, but to any tokenizer in general? bq. I think WhitespaceTokenizer should tokenize on this. Seems like majority of people would want it to be tokenized, I agree. But if you change this then there is no way to go back to previous behavior. Currently it's relatively easy to wrap your input in a reader that replaces those problematic codepoints on the fly before they're fed to the tokenizer? [1] https://www.cs.tut.fi/~jkorpela/shy.html > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985540#comment-14985540 ] Jack Krupansky commented on LUCENE-6874: +1 for using the Unicode definition of white space rather than the (odd) Java definition. From a Solr user perspective, the fact that Java is used for implementation under the hood should be irrelevant. That said, the Javadoc for WhitespaceTokenizer#isTokenChar does explicitly refer to isWhitespace already. The term "non-breaking white space" explicitly refers to line breaking and has no mention of tokens in either Unicode or traditional casual usage. >From a Solr user perspective, there is like zero value to having NBSP from >HTML web pages being treated as if it were not traditional white space. >From a Solr user perspective, the primary use of whitespace tokenizer is to >avoid the fact that standard tokenizer breaks on various special characters >such as occur in product numbers. In short, the benefits to Solr users for NBSP being tokenized as white space seem to outweigh any minor use cases for treating it as non-white space. A compatibility mode can be provided if those minor use cases are considered truly worthwhile. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984940#comment-14984940 ] Dawid Weiss commented on LUCENE-6874: - Any improvement to the docs that clarify what the software does would be great :) > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984896#comment-14984896 ] Adrien Grand commented on LUCENE-6874: -- So maybe we should solve this problem by adding some documention? > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986193#comment-14986193 ] Uwe Schindler commented on LUCENE-6874: --- I opened LUCENE-6879 for the idea (which is related to this issue as it provides a simple and general way suitable for Java 8). > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985076#comment-14985076 ] Uwe Schindler commented on LUCENE-6874: --- My personal opinion on this: - The thing is called WhitespaceTokenizer, so it should do what the name says (split on isWhitespace). - If we want something else, maybe provide a separate CharTokenizer implementation that also splits on NBSP In general, whitespace tokenizer is not used for "classical" fulltext. For this type of text one would better use StandardTokenizer, ICU's Tokenizers or the language specific ones for Chinese or Japan. People using WhitespaceTokenizer are more those people which have very special types of fields, like a list of whitespace-separated tokens used for facetting or stuff like a list of product numbers. These types of tokens were always good to handle with WhitespaceTokenizer. If you wanted to keep your facet tokens together, you were able to use NBSP! So a change here would be a break for those apps :-) So I would just update documentation to explain what this thing does (splitting on whitespace and not on spaces in general). > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org