InvalidTokenOffsetsException in conjunction with highlighting and ICU folding and edgeNgrams

2011-12-12 Thread Max
Hi there,

when highlighting a field with this definition:

fieldType name=name class=solr.TextField
positionIncrementGap=100
analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.ICUTransformFilterFactory id=Any-Latin/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumbers=1
catenateAll=0
splitOnCaseChange=1/
filter class=solr.EdgeNGramFilterFactory
minGramSize=2 maxGramSize=15 side=front/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.ICUTransformFilterFactory id=Any-Latin/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumbers=1
catenateAll=0
splitOnCaseChange=1/
filter class=solr.EdgeNGramFilterFactory
minGramSize=2 maxGramSize=15 side=front/
/analyzer
/fieldType

containing this string:

Mosfellsbær

I get the following exception, if that field is in the highlight fields:

SEVERE: org.apache.solr.common.SolrException:
org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token
mosfellsbaer exceeds length of provided text sized 11
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:497)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401)
at 
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:131)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:636)
Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException:
Token mosfellsbaer exceeds length of provided text sized 11
at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:233)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:490)

I tried with solr 3.4 and 3.5, same error for both. Removing the char
filter didnt fix the problem either.

It seems like there is some weird stuff going on when folding the
string, it can be seen in the analysis view, too:

http://i.imgur.com/6B2Uh.png

The end offset remains 11 even after folding and transforming æ to
ae, which seems wrong to me.

I also stumbled upon https://issues.apache.org/jira/browse/LUCENE-1500
which seems like a similiar issue.

Is there a workaround for that problem or is the field configuration wrong?


Re: InvalidTokenOffsetsException in conjunction with highlighting and ICU folding and edgeNgrams

2011-12-12 Thread Robert Muir
On Mon, Dec 12, 2011 at 5:18 AM, Max nas...@gmail.com wrote:

 The end offset remains 11 even after folding and transforming æ to
 ae, which seems wrong to me.

End offsets refer to the *original text* so this is correct.

What is wrong, is EdgeNGramsFilter. See how it turns that 11 to a 12?


 I also stumbled upon https://issues.apache.org/jira/browse/LUCENE-1500
 which seems like a similiar issue.

 Is there a workaround for that problem or is the field configuration wrong?

For now, don't use EdgeNGrams.

-- 
lucidimagination.com


Re: InvalidTokenOffsetsException in conjunction with highlighting and ICU folding and edgeNgrams

2011-12-12 Thread Robert Muir
On Mon, Dec 12, 2011 at 5:18 AM, Max nas...@gmail.com wrote:

 It seems like there is some weird stuff going on when folding the
 string, it can be seen in the analysis view, too:

 http://i.imgur.com/6B2Uh.png


I created a bug here, https://issues.apache.org/jira/browse/LUCENE-3642

Thanks for the screenshot, makes it easy to do a test case here.

-- 
lucidimagination.com


Re: InvalidTokenOffsetsException in conjunction with highlighting and ICU folding and edgeNgrams

2011-12-12 Thread Max
Robert, thank you for creating the issue in JIRA.

However, I need ngrams on that field – is there an alternative to the
EdgeNGramFilterFactory ?

Thanks!

On Mon, Dec 12, 2011 at 1:25 PM, Robert Muir rcm...@gmail.com wrote:
 On Mon, Dec 12, 2011 at 5:18 AM, Max nas...@gmail.com wrote:

 It seems like there is some weird stuff going on when folding the
 string, it can be seen in the analysis view, too:

 http://i.imgur.com/6B2Uh.png


 I created a bug here, https://issues.apache.org/jira/browse/LUCENE-3642

 Thanks for the screenshot, makes it easy to do a test case here.

 --
 lucidimagination.com