Re: Error with highlighter and UTF-8 chars?

Peter Wolanin Tue, 24 Feb 2009 14:07:46 -0800

Actually, looking at the Lucene source and the trace:

java.lang.StringIndexOutOfBoundsException: String index out of range: 2822
        at java.lang.String.substring(String.java:1765)
        at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:274)
        at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:313)
        at 
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:84)
        at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
        at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
       ...


I see now that getBestTextFragments() takes in a token stream - and
each token in this steam already has start/end positions set.  So, the
patch at LUCENE-1500 would mitigate the exception, but looks like the
real bug is in Solr.

-Peter

On Tue, Feb 24, 2009 at 4:28 PM, Peter Wolanin <peter.wola...@acquia.com> wrote:
> So - something in the highlighting code is counting bytes when it
> should be counting characters.  Looks like a lucene bug, so I'm
> surprised others have not hit this before.  Probably this is it:
> https://issues.apache.org/jira/browse/LUCENE-1500
>
> -Peter
>
>
> On Tue, Feb 24, 2009 at 2:22 PM, Peter Wolanin <peter.wola...@acquia.com> 
> wrote:
>> Here you can see a manifestation of it when trying to highlight with ?q=daß
>>
>> <lst name="highlighting">
>> −
>> <lst name="ebdcc46ab3791a12dccd0f915a463bd2/node/11622">
>> −
>> <arr name="body">
>> −
>> <str>
>> -Community" einfach nicht mehr wahrnimmt.
>> Hätte mir am letzten Montag Nachmittag jemand gesagt, <strong>daß
>> </strong>ich am Abend
>> </str>
>> −
>> <str>
>> recht, wenn er sagte, d<strong>aß d</strong>as wirklich wertvolle an
>> Drupal schlichtweg seine (Entwickler- und Anwender-)
>> </str>
>> −
>> <str>
>> die Entstehungsgeschichte des Portals) auch dokumentiert worden, denn
>> Ihr vermutet schon richtig, da<strong>ß da</strong>
>> </str>
>> </arr>
>> </lst>
>> </lst>
>>
>>
>> You can see the "strong" tags each get offset one character more from
>> where they are supposed to be.
>>
>>
>> -Peter
>>
>>
>>
>> On Mon, Feb 23, 2009 at 8:24 AM, Peter Wolanin <peter.wola...@acquia.com> 
>> wrote:
>>> We are using Solr trunk (1.4)  - currently " nightly exported - yonik
>>> - 2009-02-05 08:06:00"
>>>
>>> -Peter
>>>
>>> On Mon, Feb 23, 2009 at 8:07 AM, Koji Sekiguchi <k...@r.email.ne.jp> wrote:
>>>> Jacob,
>>>>
>>>> What Solr version are you using? There is a bug in SolrHighlighter of Solr
>>>> 1.3,
>>>> you may want to look at:
>>>>
>>>> https://issues.apache.org/jira/browse/SOLR-925
>>>> https://issues.apache.org/jira/browse/LUCENE-1500
>>>>
>>>> regards,
>>>>
>>>> Koji
>>>>
>>>>
>>>> Jacob Singh wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> We ran into a weird one today.  We have a document which is written in
>>>>> German and everytime we make a query which matches it, we get the
>>>>> following:
>>>>>
>>>>> java.lang.StringIndexOutOfBoundsException: String index out of range: 2822
>>>>>        at java.lang.String.substring(String.java:1935)
>>>>>        at
>>>>> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:274)
>>>>>
>>>>>
>>>>> >From source diving it looks like Lucene's highlighter is trying to
>>>>> subStr against an offset that is outside the bounds of the body field
>>>>> which it is highlighting against.  Running a fq against the ID of the
>>>>> doucment returns it fine (because no highlighting is done) and I took
>>>>> the body and tried to cut the first 2822 chars and while it is near
>>>>> the end of the body, it is still in range.
>>>>>
>>>>> Here is the related code:
>>>>>
>>>>> startOffset = tokenGroup.matchStartOffset;
>>>>> endOffset = tokenGroup.matchEndOffset;
>>>>> tokenText = text.substring(startOffset, endOffset);
>>>>>
>>>>>
>>>>> This leads me to believe there is some problem with mb string encoding
>>>>> and Lucene's counting.
>>>>>
>>>>> Any ideas here?  Tomcat is configured with UTF-8 btw.
>>>>>
>>>>> Best,
>>>>> Jacob
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Peter M. Wolanin, Ph.D.
>>> Momentum Specialist,  Acquia. Inc.
>>> peter.wola...@acquia.com
>>>
>>
>>
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wola...@acquia.com
>>
>
>
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Error with highlighter and UTF-8 chars?

Reply via email to