InvalidTokenOffsetsException in conjunction with highlighting and ICU folding and edgeNgrams
Hi there, when highlighting a field with this definition: fieldType name=name class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.ICUTransformFilterFactory id=Any-Latin/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=15 side=front/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.ICUTransformFilterFactory id=Any-Latin/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=15 side=front/ /analyzer /fieldType containing this string: Mosfellsbær I get the following exception, if that field is in the highlight fields: SEVERE: org.apache.solr.common.SolrException: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token mosfellsbaer exceeds length of provided text sized 11 at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:497) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401) at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:131) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:636) Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token mosfellsbaer exceeds length of provided text sized 11 at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:233) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:490) I tried with solr 3.4 and 3.5, same error for both. Removing the char filter didnt fix the problem either. It seems like there is some weird stuff going on when folding the string, it can be seen in the analysis view, too: http://i.imgur.com/6B2Uh.png The end offset remains 11 even after folding and transforming æ to ae, which seems wrong to me. I also stumbled upon https://issues.apache.org/jira/browse/LUCENE-1500 which seems like a similiar issue. Is there a workaround for that problem or is the field configuration wrong?
Re: InvalidTokenOffsetsException in conjunction with highlighting and ICU folding and edgeNgrams
On Mon, Dec 12, 2011 at 5:18 AM, Max nas...@gmail.com wrote: The end offset remains 11 even after folding and transforming æ to ae, which seems wrong to me. End offsets refer to the *original text* so this is correct. What is wrong, is EdgeNGramsFilter. See how it turns that 11 to a 12? I also stumbled upon https://issues.apache.org/jira/browse/LUCENE-1500 which seems like a similiar issue. Is there a workaround for that problem or is the field configuration wrong? For now, don't use EdgeNGrams. -- lucidimagination.com
Re: InvalidTokenOffsetsException in conjunction with highlighting and ICU folding and edgeNgrams
On Mon, Dec 12, 2011 at 5:18 AM, Max nas...@gmail.com wrote: It seems like there is some weird stuff going on when folding the string, it can be seen in the analysis view, too: http://i.imgur.com/6B2Uh.png I created a bug here, https://issues.apache.org/jira/browse/LUCENE-3642 Thanks for the screenshot, makes it easy to do a test case here. -- lucidimagination.com
Re: InvalidTokenOffsetsException in conjunction with highlighting and ICU folding and edgeNgrams
Robert, thank you for creating the issue in JIRA. However, I need ngrams on that field – is there an alternative to the EdgeNGramFilterFactory ? Thanks! On Mon, Dec 12, 2011 at 1:25 PM, Robert Muir rcm...@gmail.com wrote: On Mon, Dec 12, 2011 at 5:18 AM, Max nas...@gmail.com wrote: It seems like there is some weird stuff going on when folding the string, it can be seen in the analysis view, too: http://i.imgur.com/6B2Uh.png I created a bug here, https://issues.apache.org/jira/browse/LUCENE-3642 Thanks for the screenshot, makes it easy to do a test case here. -- lucidimagination.com