[jira] [Commented] (SOLR-8495) Schemaless mode cannot index large text fields

2016-10-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-8495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15551472#comment-15551472
 ] 

Jan Høydahl commented on SOLR-8495:
---

See relevant comment in SOLR-9526

> Schemaless mode cannot index large text fields
> --
>
> Key: SOLR-8495
> URL: https://issues.apache.org/jira/browse/SOLR-8495
> Project: Solr
>  Issue Type: Bug
>  Components: Data-driven Schema, Schema and Analysis
>Affects Versions: 4.10.4, 5.3.1, 5.4
>Reporter: Shalin Shekhar Mangar
>  Labels: difficulty-easy, impact-medium
> Fix For: 5.5, 6.0
>
> Attachments: SOLR-8495.patch
>
>
> The schemaless mode by default indexes all string fields into an indexed 
> StrField which is limited to 32KB text. Anything larger than that leads to an 
> exception during analysis.
> {code}
> Caused by: java.lang.IllegalArgumentException: Document contains at least one 
> immense term in field="text" (whose UTF8 encoding is longer than the max 
> length 32766)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8495) Schemaless mode cannot index large text fields

2016-09-30 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536180#comment-15536180
 ] 

Cao Manh Dat commented on SOLR-8495:


Ok, so we will wait for SOLR-9526 get commited before continue working on this 
issue.

> Schemaless mode cannot index large text fields
> --
>
> Key: SOLR-8495
> URL: https://issues.apache.org/jira/browse/SOLR-8495
> Project: Solr
>  Issue Type: Bug
>  Components: Data-driven Schema, Schema and Analysis
>Affects Versions: 4.10.4, 5.3.1, 5.4
>Reporter: Shalin Shekhar Mangar
>  Labels: difficulty-easy, impact-medium
> Fix For: 5.5, 6.0
>
> Attachments: SOLR-8495.patch
>
>
> The schemaless mode by default indexes all string fields into an indexed 
> StrField which is limited to 32KB text. Anything larger than that leads to an 
> exception during analysis.
> {code}
> Caused by: java.lang.IllegalArgumentException: Document contains at least one 
> immense term in field="text" (whose UTF8 encoding is longer than the max 
> length 32766)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8495) Schemaless mode cannot index large text fields

2016-09-21 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511574#comment-15511574
 ] 

Steve Rowe commented on SOLR-8495:
--

I looked at [~caomanhdat]'s patch, and I think there's way more machinery there 
than we need to address the problem.  A couple things I noticed:

* ChunkTokenizer splits values at a maximum token length (rather than 
truncating), but I can't think of a good use for that behavior.
* ParseLongStringFieldUpdateProcessorFactory extends 
NumericFieldUpdateProcessorFactory, which doesn't make sense, since there's no 
parsing going on, and LongStringField isn't numeric. 
* ParseLongStringFieldUpdateProcessor.mutateValue() uses 
String.getBytes(Charset.defaultCharset()) to determine a value's length, but 
Lucene will use UTF-8 to string terms, so UTF-8 should be used when testing 
value lengths. 
* I don't think we need new tokenizers or processors or field types here.

I agree with [~hossman] that his SOLR-9526 approach is the way to go (including 
his TruncateFieldUpdateProcessorFactory idea mentioned above, to address the 
problem described on this issue - his suggested "1" limit neatly avoids 
worrying about encoded length issues, since each char can take up at most 3 
UTF-8 encoded bytes, and 3*1 is less than the 32,766 byte 
IndexWriter.MAX_TERM_LENGTH.)

{quote}
bq. Autodetect space-separated text above a (customizable? maybe 256 bytes or 
so by default?) threshold as tokenized text rather than as StrField.
I'm leary of this an approach like this, because it would be extremely trappy 
depending on the order docs were indexed
{quote}

I agree, hoss'ss SOLR-9526 approach will index everything as text_general but 
then add "string" fieldtype copies for values that aren't "too long".

> Schemaless mode cannot index large text fields
> --
>
> Key: SOLR-8495
> URL: https://issues.apache.org/jira/browse/SOLR-8495
> Project: Solr
>  Issue Type: Bug
>  Components: Data-driven Schema, Schema and Analysis
>Affects Versions: 4.10.4, 5.3.1, 5.4
>Reporter: Shalin Shekhar Mangar
>  Labels: difficulty-easy, impact-medium
> Fix For: 5.5, 6.0
>
> Attachments: SOLR-8495.patch
>
>
> The schemaless mode by default indexes all string fields into an indexed 
> StrField which is limited to 32KB text. Anything larger than that leads to an 
> exception during analysis.
> {code}
> Caused by: java.lang.IllegalArgumentException: Document contains at least one 
> immense term in field="text" (whose UTF8 encoding is longer than the max 
> length 32766)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8495) Schemaless mode cannot index large text fields

2016-09-20 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507076#comment-15507076
 ] 

Hoss Man commented on SOLR-8495:


the root issue here is the same as SOLR-9526: assuming (untokenized) StrField.

I think my suggestion in that issue makes the most sense -- but it doesn't 
address the surface error noted in this issue: an exception when "string" 
values are too big.

So perhaps for that we should just just add TruncateFieldUpdateProcessorFactory 
to the data_drive configs with some reasonable upper limit?

{code}
 
   solr.StrField
   1
 
{code}

bq. Autodetect space-separated text above a (customizable? maybe 256 bytes or 
so by default?) threshold as tokenized text rather than as StrField.

I'm leary of this an approach like this, because it would be extremely trappy 
depending on the order docs were indexed: similar to the float/int problems we 
have now, but probably more so, and with more confusion because it wouldn't 
neccessarily be obvious at first glance when/why StrField was choosen vs 
TextField (or even that a diff choice was made if the user didn't go look, 
since unlike the int/float issue the _output_ of the stored field would be the 
same "String" 

(and you'd only ever get an error if the first doc was a "short" string, and 
some other doc was above the 32K lucene limit ... if all the docs were under 
the 32K limit, but above the str/text threshold, you'd never get an error -- 
regardless of the order the docs were indexed in.  but one doc ordering would 
give you searchable text fields, and another doc order would give you StrFields 
that didn't match any search you tried.



> Schemaless mode cannot index large text fields
> --
>
> Key: SOLR-8495
> URL: https://issues.apache.org/jira/browse/SOLR-8495
> Project: Solr
>  Issue Type: Bug
>  Components: Data-driven Schema, Schema and Analysis
>Affects Versions: 4.10.4, 5.3.1, 5.4
>Reporter: Shalin Shekhar Mangar
>  Labels: difficulty-easy, impact-medium
> Fix For: 5.5, 6.0
>
>
> The schemaless mode by default indexes all string fields into an indexed 
> StrField which is limited to 32KB text. Anything larger than that leads to an 
> exception during analysis.
> {code}
> Caused by: java.lang.IllegalArgumentException: Document contains at least one 
> immense term in field="text" (whose UTF8 encoding is longer than the max 
> length 32766)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8495) Schemaless mode cannot index large text fields

2016-01-06 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085893#comment-15085893
 ] 

Shalin Shekhar Mangar commented on SOLR-8495:
-

+1 for #1

> Schemaless mode cannot index large text fields
> --
>
> Key: SOLR-8495
> URL: https://issues.apache.org/jira/browse/SOLR-8495
> Project: Solr
>  Issue Type: Bug
>  Components: Data-driven Schema, Schema and Analysis
>Affects Versions: 4.10.4, 5.3.1, 5.4
>Reporter: Shalin Shekhar Mangar
>  Labels: difficulty-easy, impact-medium
> Fix For: 5.5, Trunk
>
>
> The schemaless mode by default indexes all string fields into an indexed 
> StrField which is limited to 32KB text. Anything larger than that leads to an 
> exception during analysis.
> {code}
> Caused by: java.lang.IllegalArgumentException: Document contains at least one 
> immense term in field="text" (whose UTF8 encoding is longer than the max 
> length 32766)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8495) Schemaless mode cannot index large text fields

2016-01-06 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085882#comment-15085882
 ] 

Steve Rowe commented on SOLR-8495:
--

Here are the ways I can think of to address this problem:

# Autodetect space-separated text above a (customizable? maybe 256 bytes or so 
by default?) threshold as tokenized text rather than as StrField.
# Make StrField auto-truncate at Lucene's 32k limit.
# Make the guessed "strings" fieldType be TextField that uses KeywordTokenizer, 
and add a token filter that truncates StrField terms to Lucene's 32k limit

I like #1 the best, because I think it aligns with likely user expectations, 
and it doesn't silently throw away data.


> Schemaless mode cannot index large text fields
> --
>
> Key: SOLR-8495
> URL: https://issues.apache.org/jira/browse/SOLR-8495
> Project: Solr
>  Issue Type: Bug
>  Components: Data-driven Schema, Schema and Analysis
>Affects Versions: 4.10.4, 5.3.1, 5.4
>Reporter: Shalin Shekhar Mangar
>  Labels: difficulty-easy, impact-medium
> Fix For: 5.5, Trunk
>
>
> The schemaless mode by default indexes all string fields into an indexed 
> StrField which is limited to 32KB text. Anything larger than that leads to an 
> exception during analysis.
> {code}
> Caused by: java.lang.IllegalArgumentException: Document contains at least one 
> immense term in field="text" (whose UTF8 encoding is longer than the max 
> length 32766)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org