buffalob wrote:
> Because you described that fix as "quick/dirty", I'm also wondering if
> there's any broader solution that I should consider for the longer
> term to help lessen the chance of such invalid data griding my app to
> a halt? Any thoughts?
Quick & dirty in so far as your HTML_Stripper class does not handle
named entities such as " ", and numeric entities above 255, such as
–. It will not halt, but simply swallow them. You may want to
improve that by adding convert_charref() and convert_entityref() to that
class handling these cases as well, and converting cp1252 chars such as
— to their proper unicode equivalents.
The other problem is that the above ony applies to the "double-encoded"
characters, such as
<decode>&#8211;</decode>
If the feed has them encoded as part of the RSS (XML) file, i.e.
<decode>–</decode>
then they will be swallowed anyway because you encode('ascii','ignore')
in your code. You should use 'utf-8' instead of 'ascii' here.
Yet another point is that you should catch SGMLParseError.
You can also experiment with htmllib.HTMLParser or HTMLParser.HTMLParser
instead of SGMLParseError.
-- Chris
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"TurboGears" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at
http://groups.google.com/group/turbogears?hl=en
-~----------~----~----~----~------~----~------~--~---