[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-15 Thread Steve Rowe (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006170#comment-15006170 ] Steve Rowe commented on LUCENE-6874: bq. Thank for the fruitful discussions! I hope Steve Rowe is not

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-14 Thread ASF subversion and git services (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005553#comment-15005553 ] ASF subversion and git services commented on LUCENE-6874: - Commit 1714354 from

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-14 Thread ASF subversion and git services (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005563#comment-15005563 ] ASF subversion and git services commented on LUCENE-6874: - Commit 1714355 from

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-13 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004794#comment-15004794 ] Uwe Schindler commented on LUCENE-6874: --- If nobody objects, I will commit this tomorrow. >

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-12 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002248#comment-15002248 ] Uwe Schindler commented on LUCENE-6874: --- Here is the output of the reuters test: {noformat}

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-12 Thread David Smiley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002388#comment-15002388 ] David Smiley commented on LUCENE-6874: -- +1 Patch is good Uwe. > WhitespaceTokenizer should tokenize

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread David Smiley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001688#comment-15001688 ] David Smiley commented on LUCENE-6874: -- +1 I like it Uwe; nice job. Automating the generation of

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001338#comment-15001338 ] Uwe Schindler commented on LUCENE-6874: --- Result when running: {noformat} unicode-tokenizers:

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001345#comment-15001345 ] Uwe Schindler commented on LUCENE-6874: --- Sorry my fault, must be UCharacter.isUWhitespace(), result

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Steve Rowe (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001346#comment-15001346 ] Steve Rowe commented on LUCENE-6874: Uwe, you're using UCharacter,isWhitespace(), but that's the same

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001349#comment-15001349 ] Uwe Schindler commented on LUCENE-6874: --- Sorry updated my post, recognized this a minute ago. :-)

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Steve Rowe (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001000#comment-15001000 ] Steve Rowe commented on LUCENE-6874: bq. My idea was to use a Unicode data file and extract all

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000940#comment-15000940 ] Uwe Schindler commented on LUCENE-6874: --- bq. Why persist the bitset and deal with the build issues

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001257#comment-15001257 ] Uwe Schindler commented on LUCENE-6874: --- Cool! So my idea would be to write a small tool in the

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Steve Rowe (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001312#comment-15001312 ] Steve Rowe commented on LUCENE-6874: bq. My idea was to create the whitespace chars as int[] array

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Steve Rowe (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001283#comment-15001283 ] Steve Rowe commented on LUCENE-6874: bq. Would this work? Yes, but I think ICU4J is more

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001292#comment-15001292 ] Uwe Schindler commented on LUCENE-6874: --- My idea was to create the whitespace chars as int[] array

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-11 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001307#comment-15001307 ] Uwe Schindler commented on LUCENE-6874: --- ...hacking Groovy script using ICU4J as specified in

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-10 Thread David Smiley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998838#comment-14998838 ] David Smiley commented on LUCENE-6874: -- Uwe, Why persist the bitset and deal with the build issues

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-09 Thread David Smiley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997057#comment-14997057 ] David Smiley commented on LUCENE-6874: -- Sorry, I really disagree with you on this. I don't think

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-09 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998105#comment-14998105 ] Uwe Schindler commented on LUCENE-6874: --- I would be fine to remove WhitespaceTokenizer in Lucene

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-09 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997516#comment-14997516 ] Uwe Schindler commented on LUCENE-6874: --- Yeah remove it! LUCENE-6879 is enough to quickly define

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-09 Thread Adrien Grand (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997645#comment-14997645 ] Adrien Grand commented on LUCENE-6874: -- I tend to like Uwe's idea. I have often wondered what the

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-09 Thread David Smiley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1499#comment-1499 ] David Smiley commented on LUCENE-6874: -- Just for clarification, Adrien, are you suggesting that

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-07 Thread Robert Muir (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14995169#comment-14995169 ] Robert Muir commented on LUCENE-6874: - The shared factory is confusing: this is supposed to be a

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-03 Thread Steve Rowe (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987423#comment-14987423 ] Steve Rowe commented on LUCENE-6874: bq. I just noticed that your patch contains hardcoded filenames

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-03 Thread David Smiley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987455#comment-14987455 ] David Smiley commented on LUCENE-6874: -- Nice thorough job Steve! I propose that we consolidate the

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-03 Thread Steve Rowe (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988290#comment-14988290 ] Steve Rowe commented on LUCENE-6874: bq. I propose that we consolidate the TokenizerFactories here

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-03 Thread Steve Rowe (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988404#comment-14988404 ] Steve Rowe commented on LUCENE-6874: bq. My concern for Solr users is that NBSP occurs somewhat

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-03 Thread Jack Krupansky (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988338#comment-14988338 ] Jack Krupansky commented on LUCENE-6874: Certainly Solr can update its example schemas to use

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-03 Thread Jack Krupansky (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988646#comment-14988646 ] Jack Krupansky commented on LUCENE-6874: bq. Because WST and WDF should really only be used as a

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-03 Thread Jack Krupansky (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988615#comment-14988615 ] Jack Krupansky commented on LUCENE-6874: Tika is the other (main?) approach to ingesting text

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-03 Thread Yonik Seeley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988805#comment-14988805 ] Yonik Seeley commented on LUCENE-6874: -- bq. I'd implement it as a bitset for characters < 65k A

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-03 Thread David Smiley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988791#comment-14988791 ] David Smiley commented on LUCENE-6874: -- Jack: My use-case since you asked: I've got a document

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-03 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987410#comment-14987410 ] Uwe Schindler commented on LUCENE-6874: --- Thanks Steve! I just noticed that your patch contains

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985602#comment-14985602 ] Uwe Schindler commented on LUCENE-6874: --- bq. unicode whitespace is probably more useful and already

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Robert Muir (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985587#comment-14985587 ] Robert Muir commented on LUCENE-6874: - I don't think we should make yet another definition of

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985593#comment-14985593 ] Uwe Schindler commented on LUCENE-6874: --- bq. In short, the benefits to Solr users for NBSP being

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread David Smiley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985598#comment-14985598 ] David Smiley commented on LUCENE-6874: -- bq. So maybe we should solve this problem by adding some

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985611#comment-14985611 ] Uwe Schindler commented on LUCENE-6874: --- [~rcmuir]: I am already preparing a patch :-) >

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Robert Muir (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985605#comment-14985605 ] Robert Muir commented on LUCENE-6874: - You can add a CharTokenizer to ICU analysis module that just

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread David Smiley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985704#comment-14985704 ] David Smiley commented on LUCENE-6874: -- Uwe, I beg to differ on WDF but I think we can put that

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Robert Muir (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985748#comment-14985748 ] Robert Muir commented on LUCENE-6874: - I don't think we need to deprecate whitespacetokenizer, i

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985749#comment-14985749 ] Uwe Schindler commented on LUCENE-6874: --- bq. Then we could consider deprecating WhitespaceTokenizer

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985799#comment-14985799 ] Uwe Schindler commented on LUCENE-6874: --- For those people that tend to think that StandardTokenizer

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Steve Rowe (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985776#comment-14985776 ] Steve Rowe commented on LUCENE-6874: A JFlex version would be fast and simple and not require ICU to

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985885#comment-14985885 ] Uwe Schindler commented on LUCENE-6874: --- One thing to make it full flexible in Lucene Trunk (Java 8

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984873#comment-14984873 ] Dawid Weiss commented on LUCENE-6874: - Depends what you consider a trap. A non-breakable whitespace

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Jack Krupansky (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985540#comment-14985540 ] Jack Krupansky commented on LUCENE-6874: +1 for using the Unicode definition of white space

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984940#comment-14984940 ] Dawid Weiss commented on LUCENE-6874: - Any improvement to the docs that clarify what the software

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Adrien Grand (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984896#comment-14984896 ] Adrien Grand commented on LUCENE-6874: -- So maybe we should solve this problem by adding some

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986193#comment-14986193 ] Uwe Schindler commented on LUCENE-6874: --- I opened LUCENE-6879 for the idea (which is related to

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

2015-11-02 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985076#comment-14985076 ] Uwe Schindler commented on LUCENE-6874: --- My personal opinion on this: - The thing is called