[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-06-02 Thread Amrit Sarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amrit Sarkar updated LUCENE-7705:
-
Attachment: LUCENE-7705.patch

Following discussion on LUCENE-7857 and Erick's pointers,

introduced _autoGeneratePhraseQueries="false"_ on the all the FieldType 
definitions for this test case. all the test scenarios are successfully 
executed on both *master* and *branch_6x*. Patch attached.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Fix For: master (7.0), 6.7
>
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-28 Thread Erick Erickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated LUCENE-7705:
---
Attachment: LUCENE-7705.patch

final patch, incorporates R. Muir's comments (thanks!). What bugs me is that I 
_know_ that pattern is invalid but  didn't catch it.. Siii.

Committing momentarily, will backport to 6.7.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-25 Thread Erick Erickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated LUCENE-7705:
---
Attachment: LUCENE-7705.patch

Final patch, added CHANGES.txt etc. Incorporates [~rcmuir]'s suggestions, 
thanks Robert!:
1> Integer parameters are now int
2> put tokenizers back in TestRandomChains (well, really took out the special 
handling of these tokenizers)
3> added tests for 0 len and > StandardTokenizer.MAX_TOKEN_LENGTH_LIMIT (1M, 
which is still very large).
4> Added better comments in the javadocs
5> added test for input > I/O buffer size.

Unless there are objections I'll commit this tomorrow.


> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-09 Thread Amrit Sarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amrit Sarkar updated LUCENE-7705:
-
Attachment: LUCENE-7705

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-09 Thread Amrit Sarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amrit Sarkar updated LUCENE-7705:
-
Attachment: (was: LUCENE-7705)

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-09 Thread Amrit Sarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amrit Sarkar updated LUCENE-7705:
-
Attachment: LUCENE-7705

Yes Erick, I saw the "ant precommit" errors, tab instead of whitespaces, got it.

I am still seeing this:
{code}
   [junit4] Tests with failures [seed: C3F5B66314F27B5E]:
   [junit4]   - 
org.apache.solr.util.TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers
{code}
{code}
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestMaxTokenLenTokenizer -Dtests.method=testSingleFieldSameAnalyzers 
-Dtests.seed=C3F5B66314F27B5E -Dtests.slow=true -Dtests.locale=fr-CA 
-Dtests.timezone=Asia/Qatar -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.10s | 
TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers <<<
   [junit4]> Throwable #1: java.lang.RuntimeException: Exception during 
query
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([C3F5B66314F27B5E:A927890C4C11AB91]:0)
   [junit4]>at 
org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:896)
   [junit4]>at 
org.apache.solr.util.TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers(TestMaxTokenLenTokenizer.java:104)
   [junit4]>at java.lang.Thread.run(Thread.java:745)
   [junit4]> Caused by: java.lang.RuntimeException: REQUEST FAILED: 
xpath=//result[@numFound=1]
   [junit4]>xml response was: 
   [junit4]> 
   [junit4]> 011
   [junit4]> 
   [junit4]>request was:q=letter0:lett=xml
   [junit4]>at 
org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:889)
   [junit4]>... 40 more
{code}

But if it working for you, I am good.

You didn't include the newly created files again in the latest patch, I have 
posted a new one with "precommit" sorted and included all the files. 

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-09 Thread Erick Erickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated LUCENE-7705:
---
Attachment: LUCENE-7705.patch

Precommit passes, let's use this patch as a basis going forward. The "ant 
precommit" task will show you all the failures.

So I'm not sure what's up with the test, 
"ant -Dtestcase=TestMaxTokenLenTokenizer test" 
works just fine for me. What errors do you see?

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-09 Thread Amrit Sarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amrit Sarkar updated LUCENE-7705:
-
Attachment: LUCENE-7705.patch

Erick,

I have absolutely no idea how I uploaded false/incomplete patch. I certainly 
understand the above and incorporated two different configurations to show the 
difference at first place, refined the same as per your latest comments.

There is one serious issue I am facing for a day, all the tests passes on 
IntelliJ IDE but the same gets failed when I do _"ant 
-Dtestcase=TestMaxTokenLenTokenizer test"_. I don't know what to do with that.

The latest patch you uploaded is incomplete as the newly created files are not 
part of the patch. I have worked on the previous one (dated: 27-07-17). You may 
need to change the pre-commit changes on the latest patch.

Sorry for the hiccup.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-05-08 Thread Erick Erickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated LUCENE-7705:
---
Attachment: LUCENE-7705.patch

Fixed a couple of precommit issues, otherwise patch is the same.

Amrit:

this test always fails for me: TestMaxTokenLenTokenizer

assertQ("Check the total number of docs", req("q", "letter:lett"), 
"//result[@numFound=0]");

Looking at the code, numFound should be 1 I believe. The problem is that _both_ 
the index time and query time analysis trims the term to 3 characters, so the 
finding a document when searching for "lett" here is perfectly legitimate. In 
fact all tokens no matter how long and no matter what follows "let" will 
succeed. I think all the rest of the tests for fields in this set will fail for 
a similar reason when checking for search terms > the length of the token. Do 
you agree?

If you agree, let's add a few tests explicitly showing this, that way future 
people looking at the code will know it's intended behavior. I.e. add lines 
like:

// Anything that matches the first three letters should be found when maxLen=3
 assertQ("Check the total number of docs", req("q", "letter:letXyz"), 
"//result[@numFound=1]");


Or I somehow messed up the patch.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-02-26 Thread Amrit Sarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amrit Sarkar updated LUCENE-7705:
-
Attachment: LUCENE-7705.patch

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, 
> LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-02-24 Thread Amrit Sarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amrit Sarkar updated LUCENE-7705:
-
Attachment: LUCENE-7705.patch

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-02-24 Thread Erick Erickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated LUCENE-7705:
---
Attachment: LUCENE-7705.patch

Oops, forgot to "git add" on the new test file.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Assignee: Erick Erickson
>Priority: Minor
> Attachments: LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-02-23 Thread Erick Erickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated LUCENE-7705:
---
Attachment: LUCENE-7705.patch

Patch that fixes up a few comments, regularized maxChars* to maxToken* and the 
like. I enhanced a test to test tokens longer than 256 characters.

There was a problem with LowerCaseTokenizerFactory, the getMultiTermComponent 
method constructed a LowerCaseFilterFactory with the _original_ arguments 
including maxTokenLen, which then threw an error. There's a nocommit in there 
for the nonce, what's the right thing to do here?

[~amrit sarkar] Do you have any ideas for a more elegant solution? The nocommit 
is there because this is feels just too hacky, but it does prove that this is 
the problem.

It seems like we should close SOLR-10186 and just make the code changes here. 
With this patch I successfully tested adding fields with tokens longer than 256 
and shorter, so I don't think there's anything beyond this patch to do with 
Solr. I suppose we could add some maxTokenLen bits to some of the schemas just 
to exercise that (which would have found the LowerCaseTokenizerFactory bit).

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Priority: Minor
> Attachments: LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

2017-02-22 Thread Amrit Sarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amrit Sarkar updated LUCENE-7705:
-
Description: 
SOLR-10186

[~erickerickson]: Is there a good reason that we hard-code a 256 character 
limit for the CharTokenizer? In order to change this limit it requires that 
people copy/paste the incrementToken into some new class since incrementToken 
is final.
KeywordTokenizer can easily change the default (which is also 256 bytes), but 
to do so requires code rather than being able to configure it in the schema.
For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
(WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
(Factories) it would take adding a c'tor to the base class in Lucene and using 
it in the factory.
Any objections?

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the 
> max token length
> -
>
> Key: LUCENE-7705
> URL: https://issues.apache.org/jira/browse/LUCENE-7705
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Amrit Sarkar
>Priority: Minor
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character 
> limit for the CharTokenizer? In order to change this limit it requires that 
> people copy/paste the incrementToken into some new class since incrementToken 
> is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but 
> to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes 
> (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) 
> (Factories) it would take adding a c'tor to the base class in Lucene and 
> using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org