[
https://issues.apache.org/jira/browse/MAILBOX-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tellier Benoit closed MAILBOX-301.
----------------------------------
> Lucene terms length exceeded on some emails
> -------------------------------------------
>
> Key: MAILBOX-301
> URL: https://issues.apache.org/jira/browse/MAILBOX-301
> Project: James Mailbox
> Issue Type: Bug
> Components: elasticsearch
> Affects Versions: master
> Reporter: Tellier Benoit
> Fix For: master
>
>
> Lucene supports a maximum term size of 32KB
> This term size can get exceeded, causing the index to fail.
> Thus, the team had position "ignore_above" filters to filter out too long
> terms and positionned it's value to Lucene maximum.
> However, as stated in
> https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html
> :
> {code:java}
> Note:
> The value for ignore_above is the character count, but Lucene counts bytes.
> If you use UTF-8 text with many non-ASCII characters, you may want to set the
> limit to 32766 / 3 = 10922 since UTF-8 characters may occupy at most 3 bytes.
> {code}
> Thus the maximum value is computed for string length in ES and not based on
> bytes length in Lucene.
> We can craft a char sequence in UTF-8 exceeding the Lucene value but not
> triggering the ES limit.
> A much lower value (like 4KB) seems more reasonable, as long terms my not be
> significant.
> Note:
> - Implement tests:
> - Demonstrating this bug
> - Demonstrating only too long terms are ignored
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]