[jira] Commented: (SOLR-1394) HTML stripper is splitting tokens

Jason Rutherglen (JIRA) Tue, 22 Sep 2009 17:55:40 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758519#action_12758519
 ]


Jason Rutherglen commented on SOLR-1394:
----------------------------------------

Anders, 

We're seeing the error again, we're going to try this patch and
HTMLStripReader and we'll see what happens. Here's the latest
stacktrace, which is pretty much the same as the last one:

{code}
SEVERE: java.io.IOException: Mark invalid
        at java.io.BufferedReader.reset(BufferedReader.java:485)
        at org.apache.lucene.analysis.CharReader.reset(CharReader.java:63)
        at 
org.apache.solr.analysis.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:170)
        at 
org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:727)
        at 
org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:741)
        at 
org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:451
)
        at 
org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java
:637)
        at 
org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:174)
        at 
org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:50)
        at 
org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:38)
        at 
org.apache.solr.analysis.SnowballPorterFilter.incrementToken(SnowballPorterFilterFactory.java:116
)
        at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:401)
        at 
org.apache.solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:94)
        at 
org.apache.solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:80)
        at 
org.apache.lucene.analysis.TokenStream.incrementToken(TokenStream.java:316)
        at 
org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:224)
        at 
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:138)
        at 
org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.jav
a:244)
        at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:772)
        at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:755)
        at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2611)
        at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2583)
        at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:241)
        at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
        at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140)
{code}

> HTML stripper is splitting tokens
> ---------------------------------
>
>                 Key: SOLR-1394
>                 URL: https://issues.apache.org/jira/browse/SOLR-1394
>             Project: Solr
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.4
>            Reporter: Anders Melchiorsen
>         Attachments: SOLR-1394.patch
>
>
> I am having problems with the Solr HTML stripper.
> After some investigation, I have found the cause to be that the
> stripper is replacing the removed HTML with spaces. This obviously
> breaks when the HTML is in the middle of a word, like "G&uuml;nther".
> So, without knowing what I was doing, I hacked together a fix that
> uses offset correction instead.
> That seemed to work, except that closing tags and attributes still
> broke the positioning. With even less of a clue, I replaced read()
> with next() in the two methods handling those.
> Finally, invalid HTML also gave wrong offsets, and I fixed that by
> restoring numRead when rolling back the input stream.
> At this point I stopped trying to break it, so there may still be more
> problems. Or I might have introduced some problem on my own. Anyway, I
> have put the three patches at the bottom of this mail, in case
> somebody wants to move along with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1394) HTML stripper is splitting tokens

Reply via email to