[ https://issues.apache.org/jira/browse/SOLR-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766727#action_12766727 ]
Yonik Seeley commented on SOLR-1394: ------------------------------------ I've been testing this with a bunch of different HTML, and I don't see any places where this is worse, and it prevents splitting of tokens when it shouldn't. Given that the splitting is clearly a bug, and that changes to this filter won't affect the rest of Solr, I plan on committing this shortly. Things still aren't perfect as far as offsets and highlighting, but this patch makes it no worse. I modified the solr.xml document to escape the '&' and then added the strip char filter to the text field. The query was héllo OR hello OR unicode Before this patch: Good <em>unicode</em> support: héllo <em>(hell</em>o with an accent over the e) After this patch: Good <em>unicode</em> support: <em>héll</em>o <em>(hell</em>o with é accent over the e) > HTML stripper is splitting tokens > --------------------------------- > > Key: SOLR-1394 > URL: https://issues.apache.org/jira/browse/SOLR-1394 > Project: Solr > Issue Type: Bug > Components: Analysis > Affects Versions: 1.4 > Reporter: Anders Melchiorsen > Attachments: SOLR-1394.patch, SOLR-1394.patch > > > The Solr HTML stripper is replacing any removed HTML with whitespace. This is > to keep offsets correct for highlighting. > However, as was already pointed out in SOLR-42, this means that any token > containing an HTML entity will be split into several tokens. That makes the > HTML stripper completely unreliable for international text (and any text is > potentially interantional). > The current code is actually deficient for BOTH highlighting and indexing, > where the previous incarnation (that did not insert spaces) only had problems > with highlighting. > The only workaround is to not use entities at all, which is impossible in > some situations and inconvenient in most situations. If the client is > required to transform entities before handing it to Solr, it might as well be > required to also strip tags, and then the HTML stripper would not be needed > at all. > Today, we have a better solution that can be used: offset correction. We can > then avoid inserting extra whitespace, but still get correct offsets. The > attached patch implements just that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.