[jira] [Commented] (LUCENE-8947) Indexing fails with "too many tokens for field" when using custom term frequencies
[ https://issues.apache.org/jira/browse/LUCENE-8947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907330#comment-16907330 ] Michael McCandless commented on LUCENE-8947: Indeed we disable norms ... that’s a good idea to skip length accumulation when norms are disabled. I’ll give that a shot. > Indexing fails with "too many tokens for field" when using custom term > frequencies > -- > > Key: LUCENE-8947 > URL: https://issues.apache.org/jira/browse/LUCENE-8947 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 7.5 >Reporter: Michael McCandless >Priority: Major > > We are using custom term frequencies (LUCENE-7854) to index per-token scoring > signals, however for one document that had many tokens and those tokens had > fairly large (~998,000) scoring signals, we hit this exception: > {noformat} > 2019-08-05T21:32:37,048 [ERROR] (LuceneIndexing-3-thread-3) > com.amazon.lucene.index.IndexGCRDocument: Failed to index doc: > java.lang.IllegalArgumentException: too many tokens for field "foobar" > at > org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:825) > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430) > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297) > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450) > at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291) > at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264) > {noformat} > This is happening in this code in {{DefaultIndexingChain.java}}: > {noformat} > try { > invertState.length = Math.addExact(invertState.length, > invertState.termFreqAttribute.getTermFrequency()); > } catch (ArithmeticException ae) { > throw new IllegalArgumentException("too many tokens for field \"" + > field.name() + "\""); > }{noformat} > Where Lucene is accumulating the total length (number of tokens) for the > field. But total length doesn't really make sense if you are using custom > term frequencies to hold arbitrary scoring signals? Or, maybe it does make > sense, if user is using this as simple boosting, but maybe we should allow > this length to be a {{long}}? -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8947) Indexing fails with "too many tokens for field" when using custom term frequencies
[ https://issues.apache.org/jira/browse/LUCENE-8947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907086#comment-16907086 ] Adrien Grand commented on LUCENE-8947: -- Changing it to a long might be challenging for norms, since the current encoding relies on the fact that the length is an integer. Are you using norms, I guess not? Maybe we could skip computing the field length when norms are disabled? > Indexing fails with "too many tokens for field" when using custom term > frequencies > -- > > Key: LUCENE-8947 > URL: https://issues.apache.org/jira/browse/LUCENE-8947 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 7.5 >Reporter: Michael McCandless >Priority: Major > > We are using custom term frequencies (LUCENE-7854) to index per-token scoring > signals, however for one document that had many tokens and those tokens had > fairly large (~998,000) scoring signals, we hit this exception: > {noformat} > 2019-08-05T21:32:37,048 [ERROR] (LuceneIndexing-3-thread-3) > com.amazon.lucene.index.IndexGCRDocument: Failed to index doc: > java.lang.IllegalArgumentException: too many tokens for field "foobar" > at > org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:825) > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430) > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297) > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450) > at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291) > at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264) > {noformat} > This is happening in this code in {{DefaultIndexingChain.java}}: > {noformat} > try { > invertState.length = Math.addExact(invertState.length, > invertState.termFreqAttribute.getTermFrequency()); > } catch (ArithmeticException ae) { > throw new IllegalArgumentException("too many tokens for field \"" + > field.name() + "\""); > }{noformat} > Where Lucene is accumulating the total length (number of tokens) for the > field. But total length doesn't really make sense if you are using custom > term frequencies to hold arbitrary scoring signals? Or, maybe it does make > sense, if user is using this as simple boosting, but maybe we should allow > this length to be a {{long}}? -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org