[ https://issues.apache.org/jira/browse/SOLR-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757676#action_12757676 ]
Yonik Seeley commented on SOLR-1394: ------------------------------------ What's clear is that the html stripper still has problems - less clear to me is if this patch as it currently exists is better than what's in trunk.... if people think it is, we could commit it quickly for 1.4 > HTML stripper is splitting tokens > --------------------------------- > > Key: SOLR-1394 > URL: https://issues.apache.org/jira/browse/SOLR-1394 > Project: Solr > Issue Type: Bug > Components: Analysis > Affects Versions: 1.4 > Reporter: Anders Melchiorsen > Attachments: SOLR-1394.patch > > > I am having problems with the Solr HTML stripper. > After some investigation, I have found the cause to be that the > stripper is replacing the removed HTML with spaces. This obviously > breaks when the HTML is in the middle of a word, like "Günther". > So, without knowing what I was doing, I hacked together a fix that > uses offset correction instead. > That seemed to work, except that closing tags and attributes still > broke the positioning. With even less of a clue, I replaced read() > with next() in the two methods handling those. > Finally, invalid HTML also gave wrong offsets, and I fixed that by > restoring numRead when rolling back the input stream. > At this point I stopped trying to break it, so there may still be more > problems. Or I might have introduced some problem on my own. Anyway, I > have put the three patches at the bottom of this mail, in case > somebody wants to move along with this issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.