[ 
https://issues.apache.org/jira/browse/SOLR-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe resolved SOLR-2328.
------------------------------
    Resolution: Won't Fix

bq. The resolution choice was not to mean that the given example is actually 
handled by Solr, but rather to question whether the filter should be modified 
to handle all kinds of invalid HTML. There are a million ways people can 
mis-type HTML markup and I think it would be wrong to try to guess - it would 
probably introduce other bugs.

+1

bq. So my reason to close this was more to question whether this exact example 
is such a common one that causes so much pain in real search indexes that it 
warrants a special fix.

I think it's reasonable to resolve it; I just think "Can't Reproduce" is the 
wrong resolution.  I've taken the liberty of resolving as "Won't Fix".  Please 
re-open and re-resolve if this seems inappropriate.

bq. My take is that HTMLStripCF should strip real HTML, and if anyone wants a 
more fuzzy filter, then the right cure is to create a new one, e.g. 
HTMLStripLenientCharFilter using https://jsoup.org/ or similar existing stuff.

I don't know if lenient is the correct term here.  I saved {{<html><body><a 
href="http://www.twitter.com/ceonyc"@ceonyc</a></body></html>}} as an HTML file 
and Safari, Chrome and Firefox all showed nothing at all.  Is it better or more 
lenient to exclude the possibility of tokenizing {{@ceonyc}}?  I don't know.  
It can be difficult to know where the trash ends and the treasure begins.


> HTMLStripCharFilter Leaves Broken HTML Tags
> -------------------------------------------
>
>                 Key: SOLR-2328
>                 URL: https://issues.apache.org/jira/browse/SOLR-2328
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4.1
>            Reporter: Jeff Nadler
>
> Some kinds of 'bad' HTML are missed by HTMLStripCharFilter.   For example, 
> the following invalid HTML:
>      <a href=\"http://www.twitter.com/ceonyc\"@ceonyc</a>
> Is filtered to:
>      <a href="http://www.twitter.com/ceonyc"@ceonyc
> I understand the challenge here, without the end > it's tough to know what to 
> do.  It turns out that real-world web pages are full of this kind of garbage 
> HTML, and browsers (impressively!) seem to handle this quite gracefully.   
> Plus, users in my app can search for 'href' and find lots of matches (that 
> don't appear to contain 'href') as a result.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to