Actually, looking at the Lucene source and the trace: java.lang.StringIndexOutOfBoundsException: String index out of range: 2822 at java.lang.String.substring(String.java:1765) at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:274) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:313) at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:84) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) ...
I see now that getBestTextFragments() takes in a token stream - and each token in this steam already has start/end positions set. So, the patch at LUCENE-1500 would mitigate the exception, but looks like the real bug is in Solr. -Peter On Tue, Feb 24, 2009 at 4:28 PM, Peter Wolanin <peter.wola...@acquia.com> wrote: > So - something in the highlighting code is counting bytes when it > should be counting characters. Looks like a lucene bug, so I'm > surprised others have not hit this before. Probably this is it: > https://issues.apache.org/jira/browse/LUCENE-1500 > > -Peter > > > On Tue, Feb 24, 2009 at 2:22 PM, Peter Wolanin <peter.wola...@acquia.com> > wrote: >> Here you can see a manifestation of it when trying to highlight with ?q=daß >> >> <lst name="highlighting"> >> − >> <lst name="ebdcc46ab3791a12dccd0f915a463bd2/node/11622"> >> − >> <arr name="body"> >> − >> <str> >> -Community" einfach nicht mehr wahrnimmt. >> Hätte mir am letzten Montag Nachmittag jemand gesagt, <strong>daß >> </strong>ich am Abend >> </str> >> − >> <str> >> recht, wenn er sagte, d<strong>aß d</strong>as wirklich wertvolle an >> Drupal schlichtweg seine (Entwickler- und Anwender-) >> </str> >> − >> <str> >> die Entstehungsgeschichte des Portals) auch dokumentiert worden, denn >> Ihr vermutet schon richtig, da<strong>ß da</strong> >> </str> >> </arr> >> </lst> >> </lst> >> >> >> You can see the "strong" tags each get offset one character more from >> where they are supposed to be. >> >> >> -Peter >> >> >> >> On Mon, Feb 23, 2009 at 8:24 AM, Peter Wolanin <peter.wola...@acquia.com> >> wrote: >>> We are using Solr trunk (1.4) - currently " nightly exported - yonik >>> - 2009-02-05 08:06:00" >>> >>> -Peter >>> >>> On Mon, Feb 23, 2009 at 8:07 AM, Koji Sekiguchi <k...@r.email.ne.jp> wrote: >>>> Jacob, >>>> >>>> What Solr version are you using? There is a bug in SolrHighlighter of Solr >>>> 1.3, >>>> you may want to look at: >>>> >>>> https://issues.apache.org/jira/browse/SOLR-925 >>>> https://issues.apache.org/jira/browse/LUCENE-1500 >>>> >>>> regards, >>>> >>>> Koji >>>> >>>> >>>> Jacob Singh wrote: >>>>> >>>>> Hi, >>>>> >>>>> We ran into a weird one today. We have a document which is written in >>>>> German and everytime we make a query which matches it, we get the >>>>> following: >>>>> >>>>> java.lang.StringIndexOutOfBoundsException: String index out of range: 2822 >>>>> at java.lang.String.substring(String.java:1935) >>>>> at >>>>> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:274) >>>>> >>>>> >>>>> >From source diving it looks like Lucene's highlighter is trying to >>>>> subStr against an offset that is outside the bounds of the body field >>>>> which it is highlighting against. Running a fq against the ID of the >>>>> doucment returns it fine (because no highlighting is done) and I took >>>>> the body and tried to cut the first 2822 chars and while it is near >>>>> the end of the body, it is still in range. >>>>> >>>>> Here is the related code: >>>>> >>>>> startOffset = tokenGroup.matchStartOffset; >>>>> endOffset = tokenGroup.matchEndOffset; >>>>> tokenText = text.substring(startOffset, endOffset); >>>>> >>>>> >>>>> This leads me to believe there is some problem with mb string encoding >>>>> and Lucene's counting. >>>>> >>>>> Any ideas here? Tomcat is configured with UTF-8 btw. >>>>> >>>>> Best, >>>>> Jacob >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> -- >>> Peter M. Wolanin, Ph.D. >>> Momentum Specialist, Acquia. Inc. >>> peter.wola...@acquia.com >>> >> >> >> >> -- >> Peter M. Wolanin, Ph.D. >> Momentum Specialist, Acquia. Inc. >> peter.wola...@acquia.com >> > > > > -- > Peter M. Wolanin, Ph.D. > Momentum Specialist, Acquia. Inc. > peter.wola...@acquia.com > -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com