[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006170#comment-15006170
]
Steve Rowe commented on LUCENE-6874:
bq. Thank for the fruitful discussions! I hope Steve Rowe is not
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005553#comment-15005553
]
ASF subversion and git services commented on LUCENE-6874:
-
Commit 1714354 from
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005563#comment-15005563
]
ASF subversion and git services commented on LUCENE-6874:
-
Commit 1714355 from
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004794#comment-15004794
]
Uwe Schindler commented on LUCENE-6874:
---
If nobody objects, I will commit this tomorrow.
>
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002248#comment-15002248
]
Uwe Schindler commented on LUCENE-6874:
---
Here is the output of the reuters test:
{noformat}
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002388#comment-15002388
]
David Smiley commented on LUCENE-6874:
--
+1 Patch is good Uwe.
> WhitespaceTokenizer should tokenize
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001688#comment-15001688
]
David Smiley commented on LUCENE-6874:
--
+1 I like it Uwe; nice job. Automating the generation of
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001338#comment-15001338
]
Uwe Schindler commented on LUCENE-6874:
---
Result when running:
{noformat}
unicode-tokenizers:
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001345#comment-15001345
]
Uwe Schindler commented on LUCENE-6874:
---
Sorry my fault, must be UCharacter.isUWhitespace(), result
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001346#comment-15001346
]
Steve Rowe commented on LUCENE-6874:
Uwe, you're using UCharacter,isWhitespace(), but that's the same
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001349#comment-15001349
]
Uwe Schindler commented on LUCENE-6874:
---
Sorry updated my post, recognized this a minute ago. :-)
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001000#comment-15001000
]
Steve Rowe commented on LUCENE-6874:
bq. My idea was to use a Unicode data file and extract all
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000940#comment-15000940
]
Uwe Schindler commented on LUCENE-6874:
---
bq. Why persist the bitset and deal with the build issues
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001257#comment-15001257
]
Uwe Schindler commented on LUCENE-6874:
---
Cool!
So my idea would be to write a small tool in the
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001312#comment-15001312
]
Steve Rowe commented on LUCENE-6874:
bq. My idea was to create the whitespace chars as int[] array
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001283#comment-15001283
]
Steve Rowe commented on LUCENE-6874:
bq. Would this work?
Yes, but I think ICU4J is more
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001292#comment-15001292
]
Uwe Schindler commented on LUCENE-6874:
---
My idea was to create the whitespace chars as int[] array
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001307#comment-15001307
]
Uwe Schindler commented on LUCENE-6874:
---
...hacking Groovy script using ICU4J as specified in
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998838#comment-14998838
]
David Smiley commented on LUCENE-6874:
--
Uwe,
Why persist the bitset and deal with the build issues
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997057#comment-14997057
]
David Smiley commented on LUCENE-6874:
--
Sorry, I really disagree with you on this. I don't think
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998105#comment-14998105
]
Uwe Schindler commented on LUCENE-6874:
---
I would be fine to remove WhitespaceTokenizer in Lucene
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997516#comment-14997516
]
Uwe Schindler commented on LUCENE-6874:
---
Yeah remove it! LUCENE-6879 is enough to quickly define
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997645#comment-14997645
]
Adrien Grand commented on LUCENE-6874:
--
I tend to like Uwe's idea. I have often wondered what the
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1499#comment-1499
]
David Smiley commented on LUCENE-6874:
--
Just for clarification, Adrien, are you suggesting that
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14995169#comment-14995169
]
Robert Muir commented on LUCENE-6874:
-
The shared factory is confusing: this is supposed to be a
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987423#comment-14987423
]
Steve Rowe commented on LUCENE-6874:
bq. I just noticed that your patch contains hardcoded filenames
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987455#comment-14987455
]
David Smiley commented on LUCENE-6874:
--
Nice thorough job Steve!
I propose that we consolidate the
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988290#comment-14988290
]
Steve Rowe commented on LUCENE-6874:
bq. I propose that we consolidate the TokenizerFactories here
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988404#comment-14988404
]
Steve Rowe commented on LUCENE-6874:
bq. My concern for Solr users is that NBSP occurs somewhat
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988338#comment-14988338
]
Jack Krupansky commented on LUCENE-6874:
Certainly Solr can update its example schemas to use
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988646#comment-14988646
]
Jack Krupansky commented on LUCENE-6874:
bq. Because WST and WDF should really only be used as a
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988615#comment-14988615
]
Jack Krupansky commented on LUCENE-6874:
Tika is the other (main?) approach to ingesting text
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988805#comment-14988805
]
Yonik Seeley commented on LUCENE-6874:
--
bq. I'd implement it as a bitset for characters < 65k
A
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988791#comment-14988791
]
David Smiley commented on LUCENE-6874:
--
Jack: My use-case since you asked: I've got a document
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987410#comment-14987410
]
Uwe Schindler commented on LUCENE-6874:
---
Thanks Steve! I just noticed that your patch contains
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985602#comment-14985602
]
Uwe Schindler commented on LUCENE-6874:
---
bq. unicode whitespace is probably more useful and already
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985587#comment-14985587
]
Robert Muir commented on LUCENE-6874:
-
I don't think we should make yet another definition of
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985593#comment-14985593
]
Uwe Schindler commented on LUCENE-6874:
---
bq. In short, the benefits to Solr users for NBSP being
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985598#comment-14985598
]
David Smiley commented on LUCENE-6874:
--
bq. So maybe we should solve this problem by adding some
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985611#comment-14985611
]
Uwe Schindler commented on LUCENE-6874:
---
[~rcmuir]: I am already preparing a patch :-)
>
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985605#comment-14985605
]
Robert Muir commented on LUCENE-6874:
-
You can add a CharTokenizer to ICU analysis module that just
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985704#comment-14985704
]
David Smiley commented on LUCENE-6874:
--
Uwe,
I beg to differ on WDF but I think we can put that
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985748#comment-14985748
]
Robert Muir commented on LUCENE-6874:
-
I don't think we need to deprecate whitespacetokenizer, i
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985749#comment-14985749
]
Uwe Schindler commented on LUCENE-6874:
---
bq. Then we could consider deprecating WhitespaceTokenizer
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985799#comment-14985799
]
Uwe Schindler commented on LUCENE-6874:
---
For those people that tend to think that StandardTokenizer
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985776#comment-14985776
]
Steve Rowe commented on LUCENE-6874:
A JFlex version would be fast and simple and not require ICU to
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985885#comment-14985885
]
Uwe Schindler commented on LUCENE-6874:
---
One thing to make it full flexible in Lucene Trunk (Java 8
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984873#comment-14984873
]
Dawid Weiss commented on LUCENE-6874:
-
Depends what you consider a trap.
A non-breakable whitespace
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985540#comment-14985540
]
Jack Krupansky commented on LUCENE-6874:
+1 for using the Unicode definition of white space
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984940#comment-14984940
]
Dawid Weiss commented on LUCENE-6874:
-
Any improvement to the docs that clarify what the software
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984896#comment-14984896
]
Adrien Grand commented on LUCENE-6874:
--
So maybe we should solve this problem by adding some
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986193#comment-14986193
]
Uwe Schindler commented on LUCENE-6874:
---
I opened LUCENE-6879 for the idea (which is related to
[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985076#comment-14985076
]
Uwe Schindler commented on LUCENE-6874:
---
My personal opinion on this:
- The thing is called
53 matches
Mail list logo