[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amrit Sarkar updated LUCENE-7705: - Attachment: LUCENE-7705.patch Following discussion on LUCENE-7857 and Erick's pointers, introduced _autoGeneratePhraseQueries="false"_ on the all the FieldType definitions for this test case. all the test scenarios are successfully executed on both *master* and *branch_6x*. Patch attached. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Fix For: master (7.0), 6.7 > > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated LUCENE-7705: --- Attachment: LUCENE-7705.patch final patch, incorporates R. Muir's comments (thanks!). What bugs me is that I _know_ that pattern is invalid but didn't catch it.. Siii. Committing momentarily, will backport to 6.7. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated LUCENE-7705: --- Attachment: LUCENE-7705.patch Final patch, added CHANGES.txt etc. Incorporates [~rcmuir]'s suggestions, thanks Robert!: 1> Integer parameters are now int 2> put tokenizers back in TestRandomChains (well, really took out the special handling of these tokenizers) 3> added tests for 0 len and > StandardTokenizer.MAX_TOKEN_LENGTH_LIMIT (1M, which is still very large). 4> Added better comments in the javadocs 5> added test for input > I/O buffer size. Unless there are objections I'll commit this tomorrow. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amrit Sarkar updated LUCENE-7705: - Attachment: LUCENE-7705 > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amrit Sarkar updated LUCENE-7705: - Attachment: (was: LUCENE-7705) > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amrit Sarkar updated LUCENE-7705: - Attachment: LUCENE-7705 Yes Erick, I saw the "ant precommit" errors, tab instead of whitespaces, got it. I am still seeing this: {code} [junit4] Tests with failures [seed: C3F5B66314F27B5E]: [junit4] - org.apache.solr.util.TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers {code} {code} [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestMaxTokenLenTokenizer -Dtests.method=testSingleFieldSameAnalyzers -Dtests.seed=C3F5B66314F27B5E -Dtests.slow=true -Dtests.locale=fr-CA -Dtests.timezone=Asia/Qatar -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 0.10s | TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers <<< [junit4]> Throwable #1: java.lang.RuntimeException: Exception during query [junit4]>at __randomizedtesting.SeedInfo.seed([C3F5B66314F27B5E:A927890C4C11AB91]:0) [junit4]>at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:896) [junit4]>at org.apache.solr.util.TestMaxTokenLenTokenizer.testSingleFieldSameAnalyzers(TestMaxTokenLenTokenizer.java:104) [junit4]>at java.lang.Thread.run(Thread.java:745) [junit4]> Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//result[@numFound=1] [junit4]>xml response was: [junit4]> [junit4]> 011 [junit4]> [junit4]>request was:q=letter0:lett=xml [junit4]>at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:889) [junit4]>... 40 more {code} But if it working for you, I am good. You didn't include the newly created files again in the latest patch, I have posted a new one with "precommit" sorted and included all the files. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated LUCENE-7705: --- Attachment: LUCENE-7705.patch Precommit passes, let's use this patch as a basis going forward. The "ant precommit" task will show you all the failures. So I'm not sure what's up with the test, "ant -Dtestcase=TestMaxTokenLenTokenizer test" works just fine for me. What errors do you see? > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amrit Sarkar updated LUCENE-7705: - Attachment: LUCENE-7705.patch Erick, I have absolutely no idea how I uploaded false/incomplete patch. I certainly understand the above and incorporated two different configurations to show the difference at first place, refined the same as per your latest comments. There is one serious issue I am facing for a day, all the tests passes on IntelliJ IDE but the same gets failed when I do _"ant -Dtestcase=TestMaxTokenLenTokenizer test"_. I don't know what to do with that. The latest patch you uploaded is incomplete as the newly created files are not part of the patch. I have worked on the previous one (dated: 27-07-17). You may need to change the pre-commit changes on the latest patch. Sorry for the hiccup. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated LUCENE-7705: --- Attachment: LUCENE-7705.patch Fixed a couple of precommit issues, otherwise patch is the same. Amrit: this test always fails for me: TestMaxTokenLenTokenizer assertQ("Check the total number of docs", req("q", "letter:lett"), "//result[@numFound=0]"); Looking at the code, numFound should be 1 I believe. The problem is that _both_ the index time and query time analysis trims the term to 3 characters, so the finding a document when searching for "lett" here is perfectly legitimate. In fact all tokens no matter how long and no matter what follows "let" will succeed. I think all the rest of the tests for fields in this set will fail for a similar reason when checking for search terms > the length of the token. Do you agree? If you agree, let's add a few tests explicitly showing this, that way future people looking at the code will know it's intended behavior. I.e. add lines like: // Anything that matches the first three letters should be found when maxLen=3 assertQ("Check the total number of docs", req("q", "letter:letXyz"), "//result[@numFound=1]"); Or I somehow messed up the patch. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amrit Sarkar updated LUCENE-7705: - Attachment: LUCENE-7705.patch > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, > LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amrit Sarkar updated LUCENE-7705: - Attachment: LUCENE-7705.patch > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated LUCENE-7705: --- Attachment: LUCENE-7705.patch Oops, forgot to "git add" on the new test file. > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Assignee: Erick Erickson >Priority: Minor > Attachments: LUCENE-7705.patch, LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated LUCENE-7705: --- Attachment: LUCENE-7705.patch Patch that fixes up a few comments, regularized maxChars* to maxToken* and the like. I enhanced a test to test tokens longer than 256 characters. There was a problem with LowerCaseTokenizerFactory, the getMultiTermComponent method constructed a LowerCaseFilterFactory with the _original_ arguments including maxTokenLen, which then threw an error. There's a nocommit in there for the nonce, what's the right thing to do here? [~amrit sarkar] Do you have any ideas for a more elegant solution? The nocommit is there because this is feels just too hacky, but it does prove that this is the problem. It seems like we should close SOLR-10186 and just make the code changes here. With this patch I successfully tested adding fields with tokens longer than 256 and shorter, so I don't think there's anything beyond this patch to do with Solr. I suppose we could add some maxTokenLen bits to some of the schemas just to exercise that (which would have found the LowerCaseTokenizerFactory bit). > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Priority: Minor > Attachments: LUCENE-7705.patch > > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
[ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amrit Sarkar updated LUCENE-7705: - Description: SOLR-10186 [~erickerickson]: Is there a good reason that we hard-code a 256 character limit for the CharTokenizer? In order to change this limit it requires that people copy/paste the incrementToken into some new class since incrementToken is final. KeywordTokenizer can easily change the default (which is also 256 bytes), but to do so requires code rather than being able to configure it in the schema. For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) (Factories) it would take adding a c'tor to the base class in Lucene and using it in the factory. Any objections? > Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the > max token length > - > > Key: LUCENE-7705 > URL: https://issues.apache.org/jira/browse/LUCENE-7705 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Amrit Sarkar >Priority: Minor > > SOLR-10186 > [~erickerickson]: Is there a good reason that we hard-code a 256 character > limit for the CharTokenizer? In order to change this limit it requires that > people copy/paste the incrementToken into some new class since incrementToken > is final. > KeywordTokenizer can easily change the default (which is also 256 bytes), but > to do so requires code rather than being able to configure it in the schema. > For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes > (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) > (Factories) it would take adding a c'tor to the base class in Lucene and > using it in the factory. > Any objections? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org