subject:"\[jira\] \[Commented\] \(LUCENE\-8947\) Indexing fails with \"too many tokens for field\" when using custom term frequencies"

[jira] [Commented] (LUCENE-8947) Indexing fails with "too many tokens for field" when using custom term frequencies

2019-08-14 Thread Michael McCandless (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907330#comment-16907330
 ] 

Michael McCandless commented on LUCENE-8947:


Indeed we disable norms ... that’s a good idea to skip length accumulation when 
norms are disabled.  I’ll give that a shot.

> Indexing fails with "too many tokens for field" when using custom term 
> frequencies
> --
>
> Key: LUCENE-8947
> URL: https://issues.apache.org/jira/browse/LUCENE-8947
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 7.5
>Reporter: Michael McCandless
>Priority: Major
>
> We are using custom term frequencies (LUCENE-7854) to index per-token scoring 
> signals, however for one document that had many tokens and those tokens had 
> fairly large (~998,000) scoring signals, we hit this exception:
> {noformat}
> 2019-08-05T21:32:37,048 [ERROR] (LuceneIndexing-3-thread-3) 
> com.amazon.lucene.index.IndexGCRDocument: Failed to index doc: 
> java.lang.IllegalArgumentException: too many tokens for field "foobar"
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:825)
> at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
> at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
> at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297)
> at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450)
> at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291)
> at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264)
> {noformat}
> This is happening in this code in {{DefaultIndexingChain.java}}:
> {noformat}
>   try {
> invertState.length = Math.addExact(invertState.length, 
> invertState.termFreqAttribute.getTermFrequency());
>   } catch (ArithmeticException ae) {
> throw new IllegalArgumentException("too many tokens for field \"" + 
> field.name() + "\"");
>   }{noformat}
> Where Lucene is accumulating the total length (number of tokens) for the 
> field.  But total length doesn't really make sense if you are using custom 
> term frequencies to hold arbitrary scoring signals?  Or, maybe it does make 
> sense, if user is using this as simple boosting, but maybe we should allow 
> this length to be a {{long}}?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8947) Indexing fails with "too many tokens for field" when using custom term frequencies

2019-08-14 Thread Adrien Grand (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907086#comment-16907086
 ] 

Adrien Grand commented on LUCENE-8947:
--

Changing it to a long might be challenging for norms, since the current 
encoding relies on the fact that the length is an integer. Are you using norms, 
I guess not? Maybe we could skip computing the field length when norms are 
disabled?

> Indexing fails with "too many tokens for field" when using custom term 
> frequencies
> --
>
> Key: LUCENE-8947
> URL: https://issues.apache.org/jira/browse/LUCENE-8947
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 7.5
>Reporter: Michael McCandless
>Priority: Major
>
> We are using custom term frequencies (LUCENE-7854) to index per-token scoring 
> signals, however for one document that had many tokens and those tokens had 
> fairly large (~998,000) scoring signals, we hit this exception:
> {noformat}
> 2019-08-05T21:32:37,048 [ERROR] (LuceneIndexing-3-thread-3) 
> com.amazon.lucene.index.IndexGCRDocument: Failed to index doc: 
> java.lang.IllegalArgumentException: too many tokens for field "foobar"
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:825)
> at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
> at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
> at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297)
> at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450)
> at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291)
> at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264)
> {noformat}
> This is happening in this code in {{DefaultIndexingChain.java}}:
> {noformat}
>   try {
> invertState.length = Math.addExact(invertState.length, 
> invertState.termFreqAttribute.getTermFrequency());
>   } catch (ArithmeticException ae) {
> throw new IllegalArgumentException("too many tokens for field \"" + 
> field.name() + "\"");
>   }{noformat}
> Where Lucene is accumulating the total length (number of tokens) for the 
> field.  But total length doesn't really make sense if you are using custom 
> term frequencies to hold arbitrary scoring signals?  Or, maybe it does make 
> sense, if user is using this as simple boosting, but maybe we should allow 
> this length to be a {{long}}?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8947) Indexing fails with "too many tokens for field" when using custom term frequencies

[jira] [Commented] (LUCENE-8947) Indexing fails with "too many tokens for field" when using custom term frequencies

2 matches

Site Navigation

Mail list logo

Footer information