[jira] Updated: (SOLR-1394) HTML stripper is splitting tokens

Anders Melchiorsen (JIRA) Sat, 29 Aug 2009 07:33:55 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Anders Melchiorsen updated SOLR-1394:
-------------------------------------

    Attachment: SOLR-1394.patch

Proof-of-concept fixes. I don't expect them to be perfect.

There is one test that still breaks. I think the test does not make sense with 
my changes, but I did not remove it.

Also, the new functionality (offset correction) is not tested. I don't know how 
to do it.


> HTML stripper is splitting tokens
> ---------------------------------
>
>                 Key: SOLR-1394
>                 URL: https://issues.apache.org/jira/browse/SOLR-1394
>             Project: Solr
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.4
>            Reporter: Anders Melchiorsen
>         Attachments: SOLR-1394.patch
>
>
> I am having problems with the Solr HTML stripper.
> After some investigation, I have found the cause to be that the
> stripper is replacing the removed HTML with spaces. This obviously
> breaks when the HTML is in the middle of a word, like "G&uuml;nther".
> So, without knowing what I was doing, I hacked together a fix that
> uses offset correction instead.
> That seemed to work, except that closing tags and attributes still
> broke the positioning. With even less of a clue, I replaced read()
> with next() in the two methods handling those.
> Finally, invalid HTML also gave wrong offsets, and I fixed that by
> restoring numRead when rolling back the input stream.
> At this point I stopped trying to break it, so there may still be more
> problems. Or I might have introduced some problem on my own. Anyway, I
> have put the three patches at the bottom of this mail, in case
> somebody wants to move along with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1394) HTML stripper is splitting tokens

Reply via email to