[ https://issues.apache.org/jira/browse/SOLR-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753913#action_12753913 ]
Jason Rutherglen commented on SOLR-1394: ---------------------------------------- Here's the exception: {quote} Caused by: java.io.IOException: Mark invalid at java.io.BufferedReader.reset(BufferedReader.java:485) at org.apache.solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171) at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:728) at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:742) at org.apache.lucene.analysis.CharReader.read(CharReader.java:51) at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:451) at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:637) at org.apache.lucene.analysis.standard.StandardTokenizer.next(StandardTokenizer.java:198) at org.apache.lucene.analysis.standard.StandardFilter.next(StandardFilter.java:84) at org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:53) at org.apache.solr.analysis.EnglishPorterFilter.next(EnglishPorterFilterFactory.java:108) at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:245) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:162) {quote} > HTML stripper is splitting tokens > --------------------------------- > > Key: SOLR-1394 > URL: https://issues.apache.org/jira/browse/SOLR-1394 > Project: Solr > Issue Type: Bug > Components: Analysis > Affects Versions: 1.4 > Reporter: Anders Melchiorsen > Attachments: SOLR-1394.patch > > > I am having problems with the Solr HTML stripper. > After some investigation, I have found the cause to be that the > stripper is replacing the removed HTML with spaces. This obviously > breaks when the HTML is in the middle of a word, like "Günther". > So, without knowing what I was doing, I hacked together a fix that > uses offset correction instead. > That seemed to work, except that closing tags and attributes still > broke the positioning. With even less of a clue, I replaced read() > with next() in the two methods handling those. > Finally, invalid HTML also gave wrong offsets, and I fixed that by > restoring numRead when rolling back the input stream. > At this point I stopped trying to break it, so there may still be more > problems. Or I might have introduced some problem on my own. Anyway, I > have put the three patches at the bottom of this mail, in case > somebody wants to move along with this issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.