[ https://issues.apache.org/jira/browse/SOLR-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Rowe resolved SOLR-2328. ------------------------------ Resolution: Won't Fix bq. The resolution choice was not to mean that the given example is actually handled by Solr, but rather to question whether the filter should be modified to handle all kinds of invalid HTML. There are a million ways people can mis-type HTML markup and I think it would be wrong to try to guess - it would probably introduce other bugs. +1 bq. So my reason to close this was more to question whether this exact example is such a common one that causes so much pain in real search indexes that it warrants a special fix. I think it's reasonable to resolve it; I just think "Can't Reproduce" is the wrong resolution. I've taken the liberty of resolving as "Won't Fix". Please re-open and re-resolve if this seems inappropriate. bq. My take is that HTMLStripCF should strip real HTML, and if anyone wants a more fuzzy filter, then the right cure is to create a new one, e.g. HTMLStripLenientCharFilter using https://jsoup.org/ or similar existing stuff. I don't know if lenient is the correct term here. I saved {{<html><body><a href="http://www.twitter.com/ceonyc"@ceonyc</a></body></html>}} as an HTML file and Safari, Chrome and Firefox all showed nothing at all. Is it better or more lenient to exclude the possibility of tokenizing {{@ceonyc}}? I don't know. It can be difficult to know where the trash ends and the treasure begins. > HTMLStripCharFilter Leaves Broken HTML Tags > ------------------------------------------- > > Key: SOLR-2328 > URL: https://issues.apache.org/jira/browse/SOLR-2328 > Project: Solr > Issue Type: Bug > Components: Schema and Analysis > Affects Versions: 1.4.1 > Reporter: Jeff Nadler > > Some kinds of 'bad' HTML are missed by HTMLStripCharFilter. For example, > the following invalid HTML: > <a href=\"http://www.twitter.com/ceonyc\"@ceonyc</a> > Is filtered to: > <a href="http://www.twitter.com/ceonyc"@ceonyc > I understand the challenge here, without the end > it's tough to know what to > do. It turns out that real-world web pages are full of this kind of garbage > HTML, and browsers (impressively!) seem to handle this quite gracefully. > Plus, users in my app can search for 'href' and find lots of matches (that > don't appear to contain 'href') as a result. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org