[ 
https://issues.apache.org/jira/browse/SOLR-8495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-8495:
-------------------------------
    Attachment: SOLR-8495.patch

Here are the initial patch for this issue, It based on the idea #1 of 
[~steve_rowe]

This patch introduce new {{ParseLongStringFieldUpdateProcessorFactory}} which 
do the check
{code}
if (valSize > 32000) {
  return new LongStringField(stringVal);
}
{code}
So we can add new type mapping to {{AddSchemaFieldsUpdateProcessorFactory}}
{code}
<lst name="typeMapping">
  <str name="valueClass">org.apache.solr.update.processor.LongStringField</str>
  <str name="fieldType">lstring</str>
</lst>
{code}

There are some problems of this approach is :
- We must define the size of chunk ( in which we split large string into ) 
inside schema file ( for {{ChunkTokenizerFactory}} ) not inside solrconfig.
- In multi-value case, what should we do for case the first value is > 32kb and 
the second value is < 32kb? With this patch, first value is mapping into 
LongStringField and second value still a String, so 
{{AddSchemaFieldsUpdateProcessor#mapValueClassesToFieldType}} will create a 
field based on {{defaultFieldType}} ( should we modify the method? )

> Schemaless mode cannot index large text fields
> ----------------------------------------------
>
>                 Key: SOLR-8495
>                 URL: https://issues.apache.org/jira/browse/SOLR-8495
>             Project: Solr
>          Issue Type: Bug
>          Components: Data-driven Schema, Schema and Analysis
>    Affects Versions: 4.10.4, 5.3.1, 5.4
>            Reporter: Shalin Shekhar Mangar
>              Labels: difficulty-easy, impact-medium
>             Fix For: 5.5, 6.0
>
>         Attachments: SOLR-8495.patch
>
>
> The schemaless mode by default indexes all string fields into an indexed 
> StrField which is limited to 32KB text. Anything larger than that leads to an 
> exception during analysis.
> {code}
> Caused by: java.lang.IllegalArgumentException: Document contains at least one 
> immense term in field="text" (whose UTF8 encoding is longer than the max 
> length 32766)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to