buffalob wrote:
> Because you described that fix as "quick/dirty", I'm also wondering if
> there's any broader solution that I should consider for the longer
> term to help lessen the chance of such invalid data griding my app to
> a halt?   Any thoughts?

Quick & dirty in so far as your HTML_Stripper class does not handle 
named entities such as " ", and numeric entities above 255, such as 
–. It will not halt, but simply swallow them. You may want to 
improve that by adding convert_charref() and convert_entityref() to that 
class handling these cases as well, and converting cp1252 chars such as 
— to their proper unicode equivalents.

The other problem is that the above ony applies to the "double-encoded" 
characters, such as

<decode>&amp;#8211;</decode>

If the feed has them encoded as part of the RSS (XML) file, i.e.

<decode>&#8211;</decode>

then they will be swallowed anyway because you encode('ascii','ignore') 
in your code. You should use 'utf-8' instead of 'ascii' here.

Yet another point is that you should catch SGMLParseError.

You can also experiment with htmllib.HTMLParser or HTMLParser.HTMLParser 
instead of SGMLParseError.

-- Chris

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"TurboGears" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/turbogears?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to